2025

article thumbnail

Are LLMs making StackOverflow irrelevant?

The Pragmatic Engineer

Hi, this is Gergely with a bonus issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. This article is one out of five sections from The Pulse #119. Full subscribers received this issue a week and a half ago. To get articles like this in your inbox, subscribe here.

article thumbnail

Data News — Week 25.02

Christophe Blefari

HNY 2025 ( credits ) Happy new year ✨ I wish you the best for 2025. There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. I hope you will enjoy 2025. The Data News are here to stay, the format might vary during the year, but here we are for another year. Thank you so much for your support through the years.

Data 130
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data Integrity for AI: What’s Old is New Again

Precisely

Artificial Intelligence (AI) is all the rage, and rightly so. By now most of us have experienced how Gen AI and the LLMs (large language models) that fuel it are primed to transform the way we create, research, collaborate, engage, and much more. Yet along with the AI hype and excitement comes very appropriate sanity-checks asking whether AI is ready for prime-time.

article thumbnail

How to ensure consistent metrics in your warehouse

Start Data Engineering

1. Introduction 2. Centralize Metric Definitions in Code Option A: Semantic Layer for On-the-Fly Queries Option B: Pre-Aggregated Tables for Consumers 3. Conclusion & Recap 4. Required Reading 1. Introduction If youve worked on a data team, youve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.

Utilities 147
article thumbnail

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m

article thumbnail

Scaling Beyond Postgres: How to Choose a Real-Time Analytical Database

Simon Späti

Many data engineers and analysts start their journey with Postgres. Postgres is powerful, reliable, and flexible enough to handle both transactional and basic analytical workloads. It’s the Swiss Army knife of databases, and for many applications, it’s more than sufficient. But data volumes grow, analytical demands become more complex, and Postgres stops being enough.

Database 130
article thumbnail

How to Use Apache Iceberg Tables?

Analytics Vidhya

Apache Iceberg is a modern table format designed to overcome the limitations of traditional Hive tables, offering improved performance, consistency, and scalability. In this article, we will explore the evolution of Iceberg, its key features like ACID transactions, partition evolution, and time travel, and how it integrates with modern data lakes. Well also dive into […] The post How to Use Apache Iceberg Tables?

Data Lake 121

More Trending

article thumbnail

How Meta discovers data flows via lineage at scale

Engineering at Meta

Data lineage is an instrumental part of Metas Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. This allows us to verify that our users everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example well walk through in this

article thumbnail

Predictions 2025: AI As Cybersecurity Tool and Target

Snowflake

Though AI is (still) the hottest technology topic, its not the overriding issue for enterprise security in 2025. Advanced AI will open up new attack vectors and also deliver new tools for protecting an organizations data. But the underlying challenge is the sheer quantity of data that overworked cybersecurity teams face as they try to answer basic questions such as, Are we under attack?

Data Lake 105
article thumbnail

Testing and Development for Databricks Environment and Code.

Confessions of a Data Guy

Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case. Testing your Databricks codebase is no different than […] The post Testing and Development for Databricks Environment and Code. appeared first on Confessions of a Data Guy.

Coding 113
article thumbnail

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

PDF files are one of the most popular file formats today. Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs. However, PDF files also present multiple challenges when it… Read more The post What Is PDFMiner And Should You Use It – How To Extract Data From PDFs appeared first on Seattle Data Guy.

IT 130
article thumbnail

Apache Airflow® 101 Essential Tips for Beginners

Apache Airflow® is the open-source standard to manage workflows as code. It is a versatile tool used in companies across the world from agile startups to tech giants to flagship enterprises across all industries. Due to its widespread adoption, Airflow knowledge is paramount to success in the field of data engineering.

article thumbnail

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

In this episode of Unapologetically Technical, I interview Semih Salihoglu, Associate Professor at the University of Waterloo and co-founder and CEO of Kuzu. Semih is a researcher and entrepreneur with a background in distributed systems and databases. He shares his journey from a small city in Turkey to the hallowed halls of Yale University, where he studied computer science and economics.

article thumbnail

LLMs Don’t Know What They Don’t Know—And That’s a Problem by Colin Eberhardt

Scott Logic

LLMs are not just limited by hallucinationsthey fundamentally lack awareness of their own capabilities, making them overconfident in executing tasks they dont fully understand. While vibe coding embraces AIs ability to generate quick solutions, true progress lies in models that can acknowledge ambiguity, seek clarification, and recognise when they are out of their depth.

Coding 104
article thumbnail

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Towards Data Science

Building more efficient AI TLDR : Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST to classify handwritten digits. Best runs for furthest-from-centroid selection compared to full dataset. Image byauthor. What if I told you that using just 50% of your training data could achieve better results than using the fulldataset?

article thumbnail

The Alarming Cost of Poor Data Quality

Monte Carlo

When data engineers tell scary stories around a campfire, its usually a cautionary tale about the cost of poor data quality. Data downtime can occur suddenly at any timeand often not when or where youre looking for it. And its cost is the scariest part of all. But just how much can data downtime actually cost your business? In this article, well learn from a real-life data downtime horror story to understand the cost of bad data, its impacts, and how to prevent it.

Data 98
article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

A Beginner’s Guide to Geospatial with DuckDB

Simon Späti

Geospatial data is everywhere in modern analytics. Consider this scenario: you’re a data analyst at a growing restaurant chain, and your CEO asks, “Where should we open our next location?” This seemingly simple question requires analyzing competitor locations, population density, traffic patterns, and demographicsall spatial data. Traditionally, answering this question would require expensive GIS (Geographic Information Systems) software or complex database setups.

Database 130
article thumbnail

Apache Iceberg vs Delta Lake vs Hudi: Best Open Table Format for AI/ML Workloads

Analytics Vidhya

If you’re working with AI/ML workloads(like me) and trying to figure out which data format to choose, this post is for you. Whether you’re a student, analyst, or engineer, knowing the differences between Apache Iceberg, Delta Lake, and Apache Hudi can save you a ton of headaches when it comes to performance, scalability, and real-time […] The post Apache Iceberg vs Delta Lake vs Hudi: Best Open Table Format for AI/ML Workloads appeared first on Analytics Vidhya.

article thumbnail

Event time skew and global watermark in Apache Spark Structured Streaming

Waitingforcode

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

IT 130
article thumbnail

Strobelight: A profiling service built on open source technology

Engineering at Meta

Were sharing details about Strobelight, Metas profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, weve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers worth of annual capacity savings.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp Engineering

At Yelp, we encountered challenges that prompted us to enhance the training time of our ad-revenue generating models, which use a Wide and Deep Neural Network architecture for predicting ad click-through rates (pCTR). These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions. This blog post delves into our journey of optimizing training time using TensorFlow and Horovod, along with the development of ArrowStreamServer, our in-house library for lo

Datasets 104
article thumbnail

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

Confessions of a Data Guy

Building fun things is a real part of Data Engineering. Using your creative side when building a Lake House is possible, and using tools that are outside the normal box can sometimes be preferable. Checkout this video where I dive into how I build just such a Lake House using Modern Data Stack tools like […] The post Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda appeared first on Confessions of a Data Guy.

AWS 130
article thumbnail

Continuously Improving Developer Productivity at Snowflake

Snowflake

People often ask me, Why did you join Snowflake, and why did you choose to work on developer productivity? I joined Snowflake to learn from world-class engineers and be part of the highly collaborative culture. These have been the secret sauce to Snowflakes rocket-ship growth. Snowflake was embarking on a remarkable transformation of developer productivity, and I had to jump on the rocket ship as it was taking off!

article thumbnail

Mastering Multi-Cloud with Cloudera: Strategic Data & AI Deployments Across Clouds

Cloudera

In todays dynamic digital landscape, multi-cloud strategies have become vital for organizations aiming to leverage the best of both cloud and on-premises environments. As enterprises navigate complex data-driven transformations, hybrid and multi-cloud models offer unmatched flexibility and resilience. Heres a deep dive into why and how enterprises master multi-cloud deployments to enhance their data and AI initiatives.

Cloud 82
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Title Launch Observability at Netflix Scale

Netflix Tech

Part 2: Navigating Ambiguity By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques Building on the foundation laid in Part 1 , where we explored the what behind the challenges of title launch observability at Netflix, this post shifts focus to the how. How do we ensure every title launches seamlessly and remains discoverable by the right audience?

article thumbnail

The Three Levels of SQL Comprehension: What they are and why you need to know about them

dbt Developer Hub

Ever since dbt Labs acquired SDF Labs last week , I've been head-down diving into their technology and making sense of it all. The main thing I knew going in was "SDF understands SQL". It's a nice pithy quote, but the specifics are fascinating. For the next era of Analytics Engineering to be as transformative as the last, dbt needs to move beyond being a string preprocessor and into fully comprehending SQL.

SQL 78
article thumbnail

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

Fluss is a compelling new project in the realm of real-time data processing. I spoke with Jark Wu , who leads the Fluss and Flink SQL team at Alibaba Cloud, to understand its origins and potential. Jark is a key figure in the Apache Flink community, known for his work in building Flink SQL from the ground up and creating Flink CDC and Fluss. You can read the Q&A version of the conversation here, and don’t forget to listen to the podcast.

Kafka 73
article thumbnail

No Python, No SQL Templates, No YAML: Why Your Open Source Data Quality Tool Should Generate 80% Of Your Data Quality Tests Automatically

DataKitchen

No Python, No SQL Templates, No YAML: Why Your Open Source Data Quality Tool Should Generate 80% Of Your Data Quality Tests Automatically As a data engineer, ensuring data quality is both essential and overwhelming. The sheer volume of tables, the complexity of the data usage, and the volume of work make manual test writing an impossible task to get done.

SQL 74
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Getting Started with Apache Arrow

Analytics Vidhya

Data is at the core of everything, from business decisions to machine learning. But processing large-scale data across different systems is often slow. Constant format conversions add processing time and memory overhead. Traditional row-based storage formats struggle to keep up with modern analytics. This leads to slower computations, higher memory usage, and performance bottlenecks.

article thumbnail

Data Integration for AI: Top Use Cases and Steps for Success

Precisely

Key Takeaways Trusted data is critical for AI success. Data integration ensures your AI initiatives are fueled by complete, relevant, and real-time enterprise data, minimizing errors and unreliable outcomes that could harm your business. Data integration solves key business challenges. It enables faster decision-making, boosts efficiency, and reduces costs by providing self-service access to data for AI models.

article thumbnail

A case for QLC SSDs in the data center

Engineering at Meta

The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. QLC provides higher density, improved power efficiency, and better cost than existing TLC SSDs.

Bytes 101
article thumbnail

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

Bowen Deng | Machine Learning Engineer, Homefeed Candidate Generation; Zhibo Fan | Machine Learning Engineer, Homefeed Candidate Generation; Dafang He | Machine Learning Engineer, Homefeed Relevance; Ying Huang | Machine Learning Engineer, Curation; Raymond Hsu | Engineering Manager, Homefeed CG Product Enablement; James Li | Engineering Manager, Homefeed Candidate Generation; Dylan Wang | Director, Homefeed Relevance; Jay Adams | Principal Engineer, Pinner Curation &Growth Introduction At P

Systems 67
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.