Blog - Data Engineering Digest

apache-kafka-pros-cons

Blog

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the first part of this series, we talked about design patterns for data creation and the pros & cons of each system from the data contract perspective. I won’t bore you with the importance of data quality in the blog. The Fronting Kafka pattern follows a two-cluster approach. Why is Data Quality Expensive?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Data Engineering Weekly #123

Data Engineering Weekly

MARCH 19, 2023

link] Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.

Data Engineering

Data Engineering Data Engineer Engineering Media

Join 16,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

Cloudera

DECEMBER 10, 2020

In the previous post, we talked about Kerberos authentication and explained how to configure a Kafka client to authenticate using Kerberos credentials. In this post we will look into how to configure a Kafka client to authenticate using LDAP, instead of Kerberos. We use the Kafka-console-consumer for all the examples below.

Kafka

Kafka Certification Management Accessible

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

The blog posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ® ecosystem as a central, scalable and mission-critical nervous system.

Machine Learning

Machine Learning Python Kafka Java

How to Use KSQL Stream Processing and Real-Time Databases to Analyze Streaming Data in Kafka

Rockset

MARCH 19, 2020

Intro In recent years, Kafka has become synonymous with “streaming,” and with features like Kafka Streams, KSQL, joins, and integrations into sinks like Elasticsearch and Druid, there are more ways than ever to build a real-time analytics application around streaming data in Kafka.

Kafka

Kafka Database Process SQL

How AI may impact software architecture by Andrew Carr

Scott Logic

JUNE 6, 2023

Predictions around this are very hard to make, especially taking into account how fast this field is changing, so it will be interesting to revisit this blog in a couple of years to see how things are. In the rest of this blog post, I wish to consider the viability of this and its potential impact on how software architecture is designed.

Architecture

Architecture Coding Designing Systems

Java vs Python for Data Science in 2023-What's your choice?

ProjectPro

JUNE 18, 2021

This blog aims to answer all questions on how Java vs Python compare for data science and which should be the programming language of your choice for doing data science in 2021. Apache Spark is an open-source analytics engine that is used by data scientists for large-scale data processing.

Java

Java Data Science Python Programming Language

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

Rockset

APRIL 29, 2019

In this blog post I compare options for real-time analytics on DynamoDB - Elasticsearch , Athena, and Spark - in terms of ease of setup, maintenance, query capability, latency. Separately, a Glue ETL Apache Spark job can scan and dump the contents of any DynamoDB table into S3 in Parquet format.

NoSQL

NoSQL PostgreSQL AWS SQL

Top 30 Machine Learning Skills for ML Engineer in 2024

Knowledge Hut

JANUARY 16, 2024

In this comprehensive blog, we delve into the foundational aspects and intricacies of the machine learning landscape. Apache Kafka: Apache Kafka concepts such as Kafka Streams and KSQL play a major role in pre-processing of data in machine learning. Several programming languages can be used to do this.

Machine Learning

Machine Learning Engineering Programming Language Algorithm

The Good and the Bad of Apache Airflow Pipeline Orchestration

AltexSoft

NOVEMBER 7, 2022

But apparently, things were much more difficult before Apache Airflow appeared. This article covers Airflow’s pros and gives a clue why, despite all its virtues, it’s not a silver bullet. This article covers Airflow’s pros and gives a clue why, despite all its virtues, it’s not a silver bullet. What is Apache Airflow?

PostgreSQL

PostgreSQL Metadata Python MySQL

Data Engineering Weekly #112

Data Engineering Weekly

DECEMBER 18, 2022

The author writes an exciting blog, Modern data stack in a Box!! The author narrates the competitive landscape in the orchestration engine today by comparing some of the pros and cons of Airflow as its stands today. Looking at the test results, Polars implementation performs much better than Apache Spark.

Data Engineering

Data Engineering Data Engineer Engineering Relational Database

Data Engineering Weekly #111

Data Engineering Weekly

DECEMBER 11, 2022

Grab writes about how it implemented Zero trust infrastructure for Kafka. link] Lumen: Our journey with Apache Flink (Part 1) - Operation and deployment tips Lumen shares a few practical tips to run Flink in production, reflecting a few core themes to scale the streaming infrastructure.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #115

Data Engineering Weekly

JANUARY 22, 2023

Editor’s Note: Update on our blog series One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. Data Catalog - A broken promise A classic blog triggers a few conversations about Data Catalog and its future. Sign up free to test out the tool today.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

DataOps: What Is It, Core Principles, and Tools For Implementation

phData: Data Engineering

JANUARY 3, 2022

These are easier to solve as the pros and cons are much simpler to calculate. Now part of the Apache Foundation, it originally was developed by CollabNet, Inc. Challenges with Source Control Management While the pros heavily outweigh the cons, it is important to talk about the challenges associated with version control.

IT AWS Software Engineer Software Engineering

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

This blog will walk through the most popular and fascinating open source big data projects. Apache Beam Source: Google Cloud Platform Apache Beam is an advanced unified programming open-source model launched in 2016. You can contribute to Apache Beam open-source big data project here: [link] 2.

Big Data

Big Data Project Metadata Programming Language

Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

Confluent

OCTOBER 10, 2019

Apache Kafka ® and its surrounding ecosystem, which includes Kafka Connect, Kafka Streams, and KSQL, have become the technology of choice for integrating and processing these kinds of datasets. Microservices, Apache Kafka, and Domain-Driven Design (DDD) covers this in more detail. Example: Severstal.

Kafka

Kafka Google Cloud Architecture Machine Learning

Data Engineering Weekly #117

Data Engineering Weekly

FEBRUARY 5, 2023

The blog is an excellent overview of problems with floating points and integer data types. The blog walkthrough how with a single CLI command, users can create their own Ray cluster with preinstalled ML tools, ready-to-run notebook tutorials, VS Code server for in-browser editing, and SSH access.

Data Engineering

Data Engineering Data Engineer Engineering Food

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly #123

Webinars

Trending Sources

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

Webinars

Machine Learning with Python, Jupyter, KSQL and TensorFlow

How to Use KSQL Stream Processing and Real-Time Databases to Analyze Streaming Data in Kafka

How AI may impact software architecture by Andrew Carr

Java vs Python for Data Science in 2023-What's your choice?

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

Top 30 Machine Learning Skills for ML Engineer in 2024

The Good and the Bad of Apache Airflow Pipeline Orchestration

Data Engineering Weekly #112

Data Engineering Weekly #111

Data Engineering Weekly #115

DataOps: What Is It, Core Principles, and Tools For Implementation

20 Best Open Source Big Data Projects to Contribute on GitHub

Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

Data Engineering Weekly #117

Stay Connected