Skip to content

Data News — Week 24.20

Data News #24.20 — Big edition, 5000 members ❤️, launching Qrators to search in videos, Data Council, OpenAI and Google I/O stuff and data eng stuff.

Christophe Blefari
Christophe Blefari
6 min read
photography of spot light turned on
Lights on (credits)

Hello you. The sun is out, the days are getting longer and Data News is still here. Next week marks 3 years of this newsletter/blog (yay 🎉 ). It'll be a time for looking back, reflecting and celebrating, but next week. This week, we reached 5000 members.

Yes, 5000 of you read my content periodically. Just thank you ❤️.

In the recent days I've been working on a new side project. What if you could search in video content and get the exact timestamp of what you're looking for?

Let me introduce an application of this on the Data Council 2024 80 videos.

Data Council 2024 ✨

Data Council Austin is according to me one of the best conference when it comes to think about the future of data. Every year the talk that are given at DC are always full of quality content. There is a main drawback of this which is: it 80 videos of ~30 minutes and not everyone have the time to watch everything or search among the videos.

So I developed an app which allows you to search for words in the Data Council video playlist and we've curated highlights with Julien in order for you to watch only what we've curated.

It's available on qrators (can be pronounced curators / creators). And it works for the moment only on desktop.

The search is working greatly, for instance you can full-text or exact-term search. For instance Airflow, dbt, backfill, "data mesh" or "SQL Glot". Quotes means an exact-term search.

I'll write another post later about the behind the scene and how this app has been built, but because I'm your humble servant, this app uses DuckDB WASM and requires no backend to work (except a bucket with the data).

Still, I want you to get as always a few takeaways of the conference so here are my favourite talks with a few highlights:

  • Data culture as a product— Abhi already had one of my favourite talk from Data Council 2023 about Metrics tree. Following on from his work on metrics, this time he attempts to give advice on creating a good data culture in order to create a good decision culture in companies. After all, companies need to make decisions, and these decisions need to be informed by data. [highlights]
  • Processing trillions of records at Okta with DuckDB instead of Snowflake — it was one of my most expected talk from the council because a few months ago Jake posted on LinkedIn his team successfully reduced Snowflake billing by hundred of thousands by shifting to DuckDB. In the talk he explained what was the issue with Snowflake and how a multi-engine data stack built on-top of S3 + Lambda drastically reduced dollars spent. [highlights]

I like a few of the others talks, but I'll do a dedicated post for this I think because the Data News is already super dense.

Launching Qrators, place to search for stuff in videos

AI News 🤖

  • OpenAI recents announcements — The company behind ChatGPT announced a few things hyping everyone recently. Especially their GPT-4o (it's 4-o the letter and not 40 the number) model, this model adds new capabilities to ChatGPT around photos, videos and audio. The model can talk or understand what's in a image/video and answer questions about it. They also released a MacOS app that you can call on Option+Space to ask ChatGPT. OpenAI also detailed a bit their model specs and what principles they implemented to put guardrails around answers.
  • Why Microsoft invested in OpenAI in 2019 — Emails explaining why Satya Nadella (CEO) and Kevin Scott (CTO) pushed Microsoft to invest in OpenAI have been made public, and are worth a look. It mainly reads that Microsoft was "several years behind the competition in terms of ML scale" (compared to Google, in search / ML in applications) and that to get there, they needed someone with gigantic ambition, from silicon chips to high-level programming abstractions. And the OpenAI team was someone.
  • OpenAI is offering $10m packages to top AI researchers. There is a paywall I can't say more.
  • AI lobbyists are everywhere now — A bit more politic but with the stakes around AI (money, power, content moderation and generation, privacy, etc.) lobbying around it is through the roof.
  • Google I/O keynote — Google I/O was the response from Google to OpenAI announcement around models. They showcased agents that can help you do more in your favourite Google Apps, then DeepMind showcases new capabilities around image and music processing / generation. But one of the most important announcement was only a few seconds about search that might change forever (paywall can be avoided with a page reader). Google introduced AI overview that will be presented first in search answer pushing traditional results far below.
  • LLMs with Keras — Keras team demoed various workflows around LLMs (Gemma) with Keras.
  • Opt-out to avoid Slack training LLM models on your private data — Slack (acquired by Salesforce) could train their LLM models on your data. Still their answered in the Twitter thread but it's legal stuff I don't understand.
  • HuggingFace releases Idefics2 — An open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. Works with multiple images as well to create stories.
  • Building DoorDash’s product knowledge graph with LLMs — A good graph is like good wine and DoorDash used LLMs capabilities in information extraction to improve their product catalog graph.
man using welding machine
Fusion (credits)

Fast News ⚡️

Food for thought to end (because it's already too long)


See you next week for the anniversary 🎂

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

How to build a data team

This article will give you a list of the top resources to follow when building a data team.

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.