Blog - Data Engineering Digest

Analysis of Confluent Buying Immerok

Jesse Anderson

JANUARY 9, 2023

I started a Twitter thread with some of my initial thoughts, but I want to write a post giving more analysis and opinions. I think it’s quite telling that even the announcement doesn’t get ksqlDB’s name right. Since Kafka Streams is part of the Apache project, I don’t see it going away as quickly.

Kafka

Kafka Coding Technology SQL

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

Apache Iceberg’s ecosystem of diverse adopters, contributors and commercial support continues to grow, establishing itself as the industry standard table format for an open data lakehouse architecture. If so, then the GLUE catalog integration provides an easy way to start querying those tables with Snowflake.

Building

Building Metadata Cloud Storage AWS

Getting Started With Cloudera Open Data Lakehouse on Private Cloud

Cloudera

OCTOBER 16, 2023

Cloudera recently released a fully featured Open Data Lakehouse , powered by Apache Iceberg in the private cloud, in addition to what’s already been available for the Open Data Lakehouse in the public cloud since last year. Please note, you can also leverage Flink and SQL Stream Builder in CSA 1.11 Cloudera Flow Management 2.1.6

Cloud

Cloud Kafka SQL Data

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion. Data decays! Use case recap.

Process

Process Kafka Scala SQL

Brief History of Data Engineering

Jesse Anderson

DECEMBER 12, 2022

Doug Cutting took those papers and created Apache Hadoop in 2005. Cloudera was started in 2008, and HortonWorks started in 2011. Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption. They eventually merged in 2012.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. Written by Ritesh Varyani and Jeana Choi at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. I found this GitHub tutorial from Microsoft to be an excellent resource to get started with Gen-AI if you’re beginning your journey to understand the landscape. Lackluster AI/ML results often stem from poor data quality.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

Building Real Time Applications On Streaming Data With Eventador

Data Engineering Podcast

APRIL 19, 2020

In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. If you were to start over today what would you do differently?

Building

Building PostgreSQL MongoDB SQL

Data Engineering Weekly #141

Data Engineering Weekly

AUGUST 6, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. So, let's not wait for that and get you registered right away! See how it works today.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Can you start by explaining what Spark is? Who uses Spark?

Scala

Scala MySQL Kafka Hadoop

Data Engineering Weekly #157

Data Engineering Weekly

FEBRUARY 4, 2024

The state-machine analogy is vital; your design approach and usage will significantly improve once you get that perspective. link] [link] Grab: Rethinking Stream Processing - Data Exploration Grab writes an excellent blog about data exploration on stream processing. The goal is to deliver, not detour.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Data Engineering Annotated Monthly – October 2021

Big Data Tools

NOVEMBER 8, 2021

BTW, if you would prefer to get this in your email, you can subscribe to the newsletter here. Spark Release 3.2.0 – We’ll start with the big news first. Apache Spark® has been released and there are a load of changes, including ANSI SQL support, Pandas API layer over PySpark, and lots and lots of other things.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – October 2021

Big Data Tools

NOVEMBER 8, 2021

BTW, if you would prefer to get this in your email, you can subscribe to the newsletter here. Spark Release 3.2.0 – We’ll start with the big news first. Apache Spark® has been released and there are a load of changes, including ANSI SQL support, Pandas API layer over PySpark, and lots and lots of other things.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Data Engineering Podcast

DECEMBER 22, 2019

Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Can you start by describing what Materialize is and the problems that you are aiming to solve with it? In the list of features, you highlight full support for ANSI SQL 92. What was your motivation for creating it?

SQL

SQL Building Engineering Java

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

It’s the start of June. That means it’s time to start taking summer vacations and enjoying some fresh juice alongside your fresh news! I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. There are several solutions. I am an old-school guy.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

It’s the start of June. That means it’s time to start taking summer vacations and enjoying some fresh juice alongside your fresh news! I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. There are several solutions. I am an old-school guy.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #124

Data Engineering Weekly

MARCH 26, 2023

link] Editor’s Note: Last Call for Data Council & Get 30% off on Real-Time Analytical Summit - 2023 I’m excited to attend this year’s Data Council, Austin conference. The blog highlights that the job is not just writing SQL but providing a strategic business solution for an organization.

Data Engineering

Data Engineering Data Engineer Engineering Lambda Architecture

Cloudera DataFlow’s key milestones and wins in 2020

Cloudera

FEBRUARY 17, 2021

Digital Transformation efforts across various companies were accelerated since the need of the hour is to get immediate actionable intelligence and make business decisions faster. Spin up clusters of NiFi, Kafka, or Flink very quickly onto your public cloud environments on AWS or Azure.

Kafka

Kafka Food Manufacturing Healthcare

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Data Hub – has expanded to support all stages of the data lifecycle: Collect – Flow Management (Apache NiFi), Streams Management (Apache Kafka) and Streaming Analytics (Apache Flink). Enrich – Data Engineering (Apache Spark and Apache Hive). Predict – Data Engineering (Apache Spark).

Cloud

Cloud Data Warehouse AWS NoSQL

Top 20+ Big Data Certifications and Courses in 2023

Knowledge Hut

SEPTEMBER 6, 2023

Big Data Frameworks : Familiarity with popular Big Data frameworks such as Hadoop, Apache Spark, Apache Flink, or Kafka are the tools used for data processing. I personally feel such certifications have the potential to change your life. Why Should You Take Big Data Certification? It prepares you to solve real world problems.

Big Data

Big Data Certification Hadoop Scala

The Top 25 Data Engineering Influencers and Content Creators on LinkedIn

Databand.ai

DECEMBER 13, 2022

But knowing who to follow is important to getting the information you want on your home feed and not just a bunch of noise. LinkedIn is full of influencers sharing new ideas and sparking conversations on all kinds of topics, and data engineering is no exception. You can also watch the video recording.

Data Engineering

Data Engineering Data Engineer Engineering AWS

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

In most countries, students start learning in September. Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Lots of happy customers are aware of Apache Camel , an integration framework that makes it possible to connect almost anything to everything. Burton the same person?

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

In most countries, students start learning in September. Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Lots of happy customers are aware of Apache Camel , an integration framework that makes it possible to connect almost anything to everything. Burton the same person?

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Java vs Python for Data Science in 2023-What's your choice?

ProjectPro

JUNE 18, 2021

These are the most common questions that our ProjectAdvisors get asked a lot from beginners getting started with a data science career. This blog aims to answer all questions on how Java vs Python compare for data science and which should be the programming language of your choice for doing data science in 2021.

Java

Java Data Science Python Programming Language

15+ AWS Projects Ideas for Beginners to Practice in 2023

ProjectPro

JULY 23, 2021

This blog presents some of the most unique and innovative AWS projects from beginner to advanced levels. Before we get into the technicalities on how one can leverage any AWS service and build some exciting AWS projects, here is a quick overview of AWS to understanding the cloud platform and its services. What are AWS projects?

AWS

AWS Project Amazon Web Services Cloud Computing

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Rockset

MARCH 18, 2024

Today, Confluent announced the general availability of its serverless Apache Flink service. Flink is one of the most popular stream processing technologies, ranked as a top five Apache project and backed by a diverse committer community including Alibaba and Apple. What is RAG?

Cloud

Cloud Building Metadata Kafka

Best Data Processing Frameworks That You Must Know

Knowledge Hut

JANUARY 18, 2024

It's an exciting journey into the data world, where dealing with huge amounts of information needs special tools to get the most out of it. Get to know more about measures of dispersion through our blogs. Other elements included in Spark Core are: Spark SQL , which provides domain-specific language used to manipulate DataFrames.

Data Process

Data Process Process Hadoop Scala

Getting Started with Cloudera Stream Processing Community Edition

Cloudera

AUGUST 10, 2022

Cloudera Stream Processing (CSP), powered by Apache Flink and Apache Kafka, provides a complete stream management and stateful processing solution. In CSP, Kafka serves as the storage streaming substrate, and Flink as the core in-stream processing engine that supports SQL and REST interfaces. Apache Kafka and SMM.

Process

Process Kafka PostgreSQL MySQL

Stream Processing vs. Real-Time Analytics Databases

Rockset

MARCH 27, 2023

This blog will clarify some conceptual differences, provide an overview of popular tools, and offer a framework for deciding which tools are best suited for specific technical requirements. Let’s start with a quick summary of both stream processing and RTA databases. Let’s get into the details. With that, let’s dive in.

Database

Database Process Scala SQL

5 Key Takeaways from #Current2023

Cloudera

OCTOBER 17, 2023

This blog is for anyone who was interested but unable to attend the conference, or anyone interested in a quick summary of what happened there. Flink is here to stay. It makes perfect sense that Apache Flink has emerged as the standard. I will cover key takeaways from Current 2023 and offer Cloudera’s perspective.

Database-centric

Database-centric Kafka Pipeline-centric Database

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Cloudera

MAY 18, 2021

In case you missed it, part I starts with a simple case of calculating streaming VWAP. Flink SQL is a data processing language that enables rapid prototyping and development of event-driven and streaming applications. Herein we explore how to calculate Intraday VaR (IVaR) from a real-time stream of tick data using streaming SQL.

SQL

SQL Java Data Business Analyst

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

In this first Google Cloud release, CDP Public Cloud provides built-in Data Hub definitions (see screenshot for more details) for: Data Ingestion (Apache NiFi, Apache Kafka). Data Preparation (Apache Spark and Apache Hive) . Analyze static (Apache Impala) and streaming (Apache Flink) data.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Top 4 Reasons Why You Should Upgrade Your Stream Processing Workloads To CDP

Cloudera

DECEMBER 14, 2020

While you may be doing this today with a previous generation of our products, it is time you started future-proofing your investment to prepare for the hybrid cloud and multi-cloud set of challenges. Apache Kafka helps data administrators and streaming app developers to buffer high volumes of streaming data for high scalability.

Process

Process Kafka Government Big Data

Data Engineering Weekly #111

Data Engineering Weekly

DECEMBER 11, 2022

After actively observing a couple of data catalog implementations, I started questioning my beliefs. link] Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline! Upsolver SQLake lets you process fast-moving data by simply writing a SQL query. If you're a cricket follower, you know what I'm talking about.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #115

Data Engineering Weekly

JANUARY 22, 2023

Editor’s Note: Update on our blog series One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. Data Catalog - A broken promise A classic blog triggers a few conversations about Data Catalog and its future. Sign up free to test out the tool today.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

In this blog post, we talk about the landscape and the challenges in workflows at Netflix. We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg. IPS enables users to continue to use the data processing patterns with minimal changes.

Process

Process Data Pipeline Datasets SQL

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

Following these statistics, big data is set to get bigger with the evolution of open-source projects. This blog will walk through the most popular and fascinating open source big data projects. This blog will walk through the most popular and fascinating open source big data projects.

Big Data

Big Data Project Metadata Programming Language

Data Engineering Annotated Monthly – August 2021

Big Data Tools

SEPTEMBER 6, 2021

Apache Pinot 0.8.0 – Apache Pinot is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. Apache Pinot 0.8.0 – Apache Pinot is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. It looks like this will be available soon in Flink!

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – August 2021

Big Data Tools

SEPTEMBER 6, 2021

Apache Pinot 0.8.0 – Apache Pinot is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. Apache Pinot 0.8.0 – Apache Pinot is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. It looks like this will be available soon in Flink!

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

The Good and the Bad of Apache Kafka Streaming Platform

AltexSoft

OCTOBER 21, 2022

Apache Kafka is an open-source, distributed streaming platform for messaging, storing, processing, and integrating large data volumes in real time. How Apache Kafka streams relate to Franz Kafka’s books. Let’s start with an asynchronous communication pattern Kafka revolves around — publish/subscribe or simply Pub/Sub.

Kafka

Kafka Hadoop ETL Tools Big Data

100+ Big Data Interview Questions and Answers 2023

ProjectPro

JANUARY 31, 2023

In this blog, we'll dive into some of the most commonly asked big data interview questions and provide concise and informative answers to help you ace your next big data job interview. Get ready to expand your knowledge and take your big data career to the next level! “Data analytics is the future, and the future is NOW! .”

Big Data

Big Data Hadoop AWS Relational Database

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

In 2015, Cloudera became one of the first vendors to provide enterprise support for Apache Kafka, which marked the genesis of the Cloudera Stream Processing (CSP) offering. Today, CSP is powered by Apache Flink and Kafka and provides a complete, enterprise-grade stream management and stateful processing solution.

Kafka

Kafka Manufacturing Data Lake SQL

20 Solved End-to-End Big Data Projects with Source Code

ProjectPro

MAY 31, 2021

This blog lists over 20 big data projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. Maybe you started using Instagram to search for some fitness videos, and now, Instagram keeps recommending videos from fitness influencers to you.

Big Data

Big Data Coding Project Hadoop

Using SQL to democratize streaming data

Cloudera

MARCH 2, 2021

Contrast that with the skills honed over decades for gaining access, building data warehouses, performing ETL, creating reports and/or applications using structured query language (SQL). The declarative nature of the SQL language makes it a powerful paradigm for getting data to the people who need it. A rare breed.

SQL

SQL Data Lake Java Scala

Analysis of Confluent Buying Immerok

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Webinars

Trending Sources

Getting Started With Cloudera Open Data Lakehouse on Private Cloud

Webinars

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Brief History of Data Engineering

Druid Deprecation and ClickHouse Adoption at Lyft

Data Engineering Weekly #151

Building Real Time Applications On Streaming Data With Eventador

Data Engineering Weekly #141

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Weekly #157

Data Engineering Annotated Monthly – October 2021

Data Engineering Annotated Monthly – October 2021

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

Data Engineering Weekly #124

Cloudera DataFlow’s key milestones and wins in 2020

Happy Birthday, CDP Public Cloud

Top 20+ Big Data Certifications and Courses in 2023

The Top 25 Data Engineering Influencers and Content Creators on LinkedIn

Data Engineering Annotated Monthly – September 2021

Data Engineering Annotated Monthly – September 2021

Java vs Python for Data Science in 2023-What's your choice?

15+ AWS Projects Ideas for Beginners to Practice in 2023

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Best Data Processing Frameworks That You Must Know

Getting Started with Cloudera Stream Processing Community Edition

Stream Processing vs. Real-Time Analytics Databases

5 Key Takeaways from #Current2023

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Top 4 Reasons Why You Should Upgrade Your Stream Processing Workloads To CDP

Data Engineering Weekly #111

Data Engineering Weekly #115

Incremental Processing using Netflix Maestro and Apache Iceberg

20 Best Open Source Big Data Projects to Contribute on GitHub

Data Engineering Annotated Monthly – August 2021

Data Engineering Annotated Monthly – August 2021

The Good and the Bad of Apache Kafka Streaming Platform

100+ Big Data Interview Questions and Answers 2023

Turning Streams Into Data Products

20 Solved End-to-End Big Data Projects with Source Code

Using SQL to democratize streaming data

Stay Connected