June, 2024

article thumbnail

Data Engineering Projects

Start Data Engineering

1. Introduction 2. Run Data Pipelines 2.1. Run on codespaces 2.2. Run locally 3. Projects 3.1. Projects from least to most complex 3.2. Batch pipelines 3.3. Stream pipelines 3.4. Event-driven pipelines 3.5. LLM RAG pipelines 4. Conclusion 1. Introduction Whether you are new to data engineering or have been in the data field for a few years, one of the most challenging parts of learning new frameworks is setting them up!

article thumbnail

What I’ve Learned After A Decade Of Data Engineering

Confessions of a Data Guy

After 10 years of Data Engineering work, I think it’s time to hang up the proverbial hat and ride off into the sunset, never to be seen again. I wish. Everything has changed in 10 years, yet nothing has changed in 10 years, how is that even possible? Sometimes I wonder if I’ve learned anything […] The post What I’ve Learned After A Decade Of Data Engineering appeared first on Confessions of a Data Guy.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

Summary Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

Data Lake 162
article thumbnail

Databricks, Snowflake and the future

Christophe Blefari

Welcome to the snow world ( credits ) Every year, the competition between Snowflake and Databricks intensifies, using their annual conferences as a platform for demonstrating their power. This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco.

Metadata 147
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Jesse Anderson

This video covers the latest announcements from StreamNative, Confluent, and WarpStream. We discuss communication protocols, how they’re used, and what they mean for you. We also discuss the various systems using Kafka’s protocol. Finally, we discuss the announcements about writing to Iceberg and DeltaLake directly from the broker and what that means for costs and operational ease.

Kafka 147
article thumbnail

Infoshare 2024 - Retrospective

Waitingforcode

Last May I gave a talk about stream processing fallacies at Infoshare in Gdansk. Besides this speaking experience, I was also - and maybe among others - an attendee who enjoyed several talks in software and data engineering areas. I'm writing this blog post to remember them and why not, share the knowledge with you!

More Trending

article thumbnail

OpenAI Acquires Rockset

Rockset

I’m excited to share that OpenAI has completed the acquisition of Rockset. We are thrilled to join the OpenAI team and bring our technology and expertise to building safe and beneficial AGI. From the start, our vision at Rockset was to fundamentally transform the way data-driven applications were built. We developed our search and analytics database, taking full advantage of the cloud, to eliminate the complexity inherent in the data infrastructure needed for these apps.

Database 145
article thumbnail

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems.

article thumbnail

Deploying Machine Learning Models: A Step-by-Step Tutorial

KDnuggets

Image by author Model deployment is the process of trained models being integrated into practical applications. This includes defining the necessary environment, specifying how input data is introduced into the model and the output produced, and the capacity to analyze new data and provide relevant predictions or categorizations.

article thumbnail

How FactSet Implemented an Enterprise Generative AI Platform with Databricks and MLflow

databricks

“FactSet’s mission is to empower clients to make data-driven decisions and supercharge their workflows and productivity. To deliver AI-driven solutions across our entire.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we’ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs.

Algorithm 131
article thumbnail

Robinhood to Acquire Bitstamp

Robinhood

This acquisition will bring Bitstamp’s globally-scaled crypto exchange to Robinhood, with retail and institutional customers across the EU, UK, US and Asia. This strategic combination better positions Robinhood to expand outside of the US and will bring a trusted and reputable institutional business to Robinhood. Expected to close in the first half of 2025, subject to customary closing conditions, including regulatory approvals.

Retail 129
article thumbnail

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Snowflake

Open source file and table formats have garnered much interest in the data industry because of their potential for interoperability — unlocking the ability for many technologies to safely operate over a single copy of data. Greater interoperability not only reduces the complexity and costs associated with using many tools and processing engines in parallel, but it would also reduce potential risks associated with vendor lock-in.

article thumbnail

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

Summary Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Data Lake 147
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Creating AI-Driven Solutions: Understanding Large Language Models

KDnuggets

Understanding LLMs is pivotal in unlocking the full potential of AI-driven solutions across various domains. As we navigate the process of building AI-driven solutions, it is essential to approach the development and deployment of LLMs with a focus on responsible AI practices.

Building 126
article thumbnail

Mosaic AI: Build and deploy production-quality Compound AI Systems

databricks

Over the last year, we have seen a surge of commercial and open-source foundation models showing strong reasoning abilities on general knowledge tasks.

Systems 144
article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we recognized that to lead in the generative AI space we’d need to transform our fleet.

Utilities 126
article thumbnail

Generative AI vs. Predictive AI: Understanding the Differences

Edureka

Is AI taking over the world? Umm, not yet, at least. However, according to a recently published report , almost 35% of global companies report using AI to optimize their business. In this article, we will take a closer look at two of the most talked about and widely used AI technologies of 2024 – generative AI and predictive AI. Table of Contents Generative AI vs Predictive AI – Comparison Table Generative AI 101: A Revolutionary Cocktail of Technology and Art How Does Generative AI

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

Thousands of customers have worked with Snowflake to cost-effectively build a secure data foundation as they look to solve a growing variety of business problems with more data. Increasingly customers are looking to expand that powerful foundation to a broader set of data across their enterprise. Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking f

Data Lake 116
article thumbnail

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness

Process 147
article thumbnail

10 GitHub Repositories to Master SQL

KDnuggets

Learn SQL and databases through free courses, tutorials, tools, guides, books, practice exercises, projects, awesome lists, and other resources.

SQL 139
article thumbnail

Introducing Databricks LakeFlow: A unified, intelligent solution for data engineering

databricks

Today, we are excited to announce Databricks LakeFlow, a new solution that contains everything you need to build and operate production data pipelines.

article thumbnail

Prepare Now: 2025's Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Data Engineering Weekly #177

Data Engineering Weekly

Experience Enterprise-Grade Apache Airflow Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Learn More → Redpoint: The InfraRed Report The impact of macroeconomic slowness results in increased focus on prioritizing reduced infrastructure spending.

article thumbnail

Databricks Follows Cloudera by Adopting Iceberg, While Snowflake Mulls Open Source Approach

Cloudera

A constant flow of breaking news from the data lakehouse space is making notable tech headlines this week. On Tuesday, Databricks announced that it will acquire Tabular, a data management company founded by the creators of Apache Iceberg, Ryan Blue, Daniel Weeks, and Jason Reidfor. The deal was for an unconfirmed sum, but some reports suggest that amount to be between $1B and $2B (and allegedly outbidding Snowflake).

AWS 111
article thumbnail

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

Python’s popularity has grown significantly, quickly becoming the preferred language for development across machine learning, application development, pipelines and more. At Snowflake we are deeply committed to delivering a best-in-class platform for Python developers. In line with this commitment, we’re thrilled to announce the public preview support of Snowpark pandas API, enabling seamless execution of distributed pandas at scale in Snowflake.

Python 112
article thumbnail

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

5 Tips to Step Up Your Data Science Game Right Away

KDnuggets

This article intends to provide practical advice for becoming a better data scientist by focusing on five different areas of proficiency. Whether you are starting out, or looking to get grounded after years as a practitioner, jump in and elevate your game.

article thumbnail

Databricks + Tabular

databricks

We are excited to announce that we have agreed to acquire Tabular, Inc, a data management company founded by Ryan Blue, Daniel Weeks.

article thumbnail

AI-Enhanced User Experiences in ArcGIS Pro 3.3

ArcGIS

Learn about the new AI-enhanced user experiences for geoprocessing in ArcGIS Pro 3.3, including semantic search and tool suggestions.

124
124
article thumbnail

Cloudera Unveils Plans for Annual Pride Celebration in Cork

Cloudera

Pride Month is underway and we at Cloudera are looking forward to joining the global celebration of diversity, equity and the ongoing effort for LGBTQ+ ( L esbian, G ay, B isexual, T ransgender, Q ueer/ Q uestioning) rights and recognition. Pride Month serves as a reminder that the fight for equality and equity for members of the LGBTQ+ community is not over.

Systems 110
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.