Blog, Building, Designing and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB Scala MySQL

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

FEBRUARY 26, 2023

Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. We feel your pain. It ends up being anything but that. When is a data mesh the wrong choice?

Building

Building Metadata Machine Learning Data Integration

Metadata Management And Integration At LinkedIn With DataHub

Data Engineering Podcast

AUGUST 24, 2020

The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. What were you using at LinkedIn for metadata management prior to the introduction of DataHub?

Metadata

Metadata Management Kafka Data Engineering

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Rockset

MARCH 18, 2024

In this blog, we’ll discuss how RAG fits into the paradigm of real-time data processing and show an example product recommendation application using both Kafka and Flink on Confluent Cloud together with Rockset. Building a real-time, contextual and trustworthy knowledge base for AI applications revolves around RAG pipelines.

Cloud

Cloud Building Metadata Kafka

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Now, we shift focus on the needs of developers and addressing the challenges they face when building dataflows in the cloud. We’ve observed organizations using more and more data sources and destinations , as well as expecting a more diverse range of developers to build data movement flows. Enabling self-service for developers.

Designing

Designing Coding Google Cloud AWS

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

We just announced the general availability of Cloudera DataFlow Designer , bringing self-service data flow development to all CDP Public Cloud customers. In our previous DataFlow Designer blog post , we introduced you to the new user interface and highlighted its key capabilities.

Data Pipeline

Data Pipeline Designing Kafka Metadata

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

On the flip side, there was a substantial appetite to build real-time ML systems from developers at Lyft. In this blog post, we will discuss what we built in support of that goal and some of the lessons we learned along the way. To meet the needs of our customers, we kicked off the Real-time Machine Learning with Streaming initiative.

Machine Learning

Machine Learning Building Metadata Kafka

Building a Control Plane for Lyft’s Shared Development Environment

Lyft Engineering

SEPTEMBER 6, 2023

Our team, the Developer Infrastructure team, aims to build the best tools to enable microservice owners (our “customers”) to reliably and quickly test changes in a local and/or end-to-end environment. Routing overrides metadata: embed metadata in API request headers defining which offloaded deployment the request will get routed to.

Building

Building Metadata Electronics Engineering

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Data Engineering Podcast

JANUARY 23, 2022

In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes.

Building

Building Management Data Pipeline Metadata

Building a Winning Data Quality Strategy: Step by Step

Databand.ai

AUGUST 30, 2023

Building a Winning Data Quality Strategy: Step by Step Eitan Chazbani August 30, 2023 What Is a Data Quality Strategy? This includes defining roles and responsibilities related to managing datasets and setting guidelines for metadata management. This starts with building a strong business case for your data quality strategy.

Building

Building Data Cleanse Data Governance Datasets

Launching the Engineering Blog

Zalando Engineering

JUNE 30, 2020

Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog. What customizations we applied to design the blog and the publishing process. Static Site Generator Our previous tech blog used a CMS which only a limited number of people had access to. It's actively developed.

Engineering

Engineering Bytes AWS Python

Data Engineering Weekly #152

Data Engineering Weekly

DECEMBER 10, 2023

Capital One: Insights on building a data strategy to drive business value One of the hotly debated and many companies struggling with is to build an agile data strategy to drive business value. link] Evidently: ML system design - 300 case studies to learn from An amazing compilation of ML system design articles from various companies.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. Atlan is the metadata hub for your data ecosystem.

Systems

Systems Metadata Data Pipeline MongoDB

Rebuilding Netflix Video Processing Pipeline with Microservices

Netflix Tech

JANUARY 10, 2024

This introductory blog focuses on an overview of our journey. Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process. A comprehensive list of benefits offered by Cosmos can be found in the linked blog. divide the input video into small chunks 2.

Process

Process Pipeline-centric Media Metadata

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. Now let’s look at how we designed the tracing infrastructure that powers Edgar. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs.

Building

Building Transportation Metadata Java

A Reflection On Data Observability As It Reaches Broader Adoption

Data Engineering Podcast

SEPTEMBER 4, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. A data observability platform?

IT

IT Metadata MongoDB MySQL

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Data Engineering Podcast

MARCH 25, 2023

The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. We feel your pain. It ends up being anything but that.

MySQL

MySQL Python Architecture Machine Learning

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency By: Di Lin , Girish Lingappa , Jitender Aswani Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question?—?“Can

Building

Building Metadata Transportation Data Ingestion

Build an end to end JSON logging system for clients apps

Pinterest Engineering

JANUARY 10, 2023

With these in mind, the following key design decisions were made: The logging service endpoint will handle logs validating, parsing, and processing. To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. Nothing else is required.

Systems

Systems Building Software Engineer Software Engineering

Data Reprocessing Pipeline in Asset Management Platform @Netflix

Netflix Tech

MARCH 10, 2023

This platform has evolved from supporting studio applications to data science applications, machine-learning applications to discover the assets metadata, and build various data facts. During this evolution, quite often we receive requests to update the existing assets metadata or add new metadata for the new features added.

Management

Management Kafka Metadata Media

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Co-Authors: Sumedh Sakdeo , Lei Sun , Sushant Raikar , Stanislav Pak , and Abhishek Nath Introduction At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. While functional, our current setup for managing tables is fragmented.

Big Data

Big Data Data Management Management Metadata

How LinkedIn Adopted A GraphQL Architecture for Product Development

LinkedIn Engineering

APRIL 25, 2023

In our previous blog post on GraphQL, we explained how LinkedIn uses GraphQL to expedite the process of onboarding new use-cases for external API partners. In this blog post, we will cover how the GraphQL layer is architected for use by our internal engineers to build member and customer facing applications.

Architecture

Architecture Metadata Java Transportation

Building and maintaining the skills taxonomy that powers LinkedIn's Skills Graph

LinkedIn Engineering

MARCH 21, 2023

One of the most exciting parts of our work is that we get to play a part in helping progress a skills-first labor market through our team’s ongoing engineering work in building our Skills Graph. Engineering vs PyTorch Figure 6: Sample Seed Skills Graph KGBert helps build a more accurate and complex taxonomy in less time.

Building

Building Recruitment Machine Learning Deep Learning

How Rockset Built Vector Search for Scale in the Cloud

Rockset

NOVEMBER 7, 2023

As a result, indexing of newly ingested vectors and metadata does not negatively impact search performance. In this blog, we’ll dig into how Rockset has fully integrated vector search into its search and analytics database. Users can continuously stream and index vectors fully isolated from search. Each cell is defined by a centroid.

Cloud

Cloud Metadata Database SQL

Long Live Data Products! Understand the 4 Stages of the Data Product Lifecycle

Snowflake

AUGUST 22, 2023

The data product lifecycle includes the following stages: Discovery Design Development Deployment Let’s take a look at what they entail and the roles that lead and support each of them: Discovery starts the process. In an earlier blog post, I discussed how the notion of a product emerges from a business need.

Metadata

Metadata Data AWS Business Analyst

How the EU’s Digital Operations Resilience Act (DORA) Aims To Strengthen Operational Resilience in Financial Services

Snowflake

APRIL 29, 2024

DORA: Building a More Secure Financial System DORA, enacted in January 2023, moves beyond reactive measures, requiring FEs and their service providers to proactively identify vulnerabilities, prevent disruptions and plan for swift recovery from incidents. ESA will designate critical ICT providers in January 2025.

Transportation

Transportation Data Governance Government Consulting

Understanding GraphQL Directives: Practical Use-Cases at Zalando

Zalando Engineering

OCTOBER 18, 2023

The query directives are generally useful for clients to express certain types of metadata for the query. We have a separate blog post explaining the details of how Zalando uses persisted queries and how we think about schema stability and granular control. We will discuss this in detail in the next section. are some entities.

Metadata

Metadata Coding Banking Designing

Data Engineering Weekly #162

Data Engineering Weekly

MARCH 10, 2024

Google: Croissant- a metadata format for ML-ready datasets Google Research introduced Croissant, a new metadata format designed to make datasets ML-ready by standardizing the format, facilitating easier use in machine learning projects. Data engineers build the systems that store and process sensitive information.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Software Identifiers through the eyes of Nix

Tweag

MARCH 11, 2024

But CISA encouraged me to publish the answer as a separate blog post. It achieves very high levels of reproducibility and provenance tracking of software artifacts by design, utilizing a functional and declarative language to describe software builds and their dependencies. More details on this can be found here or here.

Metadata

Metadata Utilities Building Designing

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

In this blog, we will cover: What Is CDC and Its Benefits? Types of CDC Audit Columns: This method involves using designated columns within tables to track incremental changes. These additional columns store metadata like timestamps, user IDs, and change types, ensuring granular change tracking and auditability.

Telecommunication

Telecommunication Metadata Healthcare Finance

6 Hard Problems Scaling Vector Search

Rockset

AUGUST 27, 2023

This blog attempts to arm you with some knowledge of your future, the problems you will face, and questions you may not know yet that you need to ask. A proper survey of these approaches would fill many blog posts of this size. If it’s soon, vector latency is a major design point in these systems.

Metadata

Metadata Database Algorithm SQL

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command. This is one way to build trust with our internal user base. This logic consists of the following parts: DDL code, table metadata information, data transformation and a few audit steps.

Data Pipeline

Data Pipeline Scala Metadata Food

Unifying Iceberg Tables on Snowflake

Snowflake

AUGUST 31, 2023

Catalog Integration: Our newly developed Catalog Integration feature allows you to seamlessly plug Snowflake into other Iceberg catalogs tracking table metadata. In this blog post, we’ll dive into the details of these features and the benefits for customers. In addition to Iceberg External Tables, we introduced Native Iceberg Tables.

Metadata

Metadata AWS Data Lake Datasets

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. Instead, we strive to help customers by providing a platform to build architectures based on what works in their organization, even if that changes over time.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

The idea behind is to solve data problem by building software. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill). Every company out there has his own definition for the data engineer role. My advice on this point is to learn from others.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. How do you handle observability of CDC flows?

Data Warehouse

Data Warehouse Metadata Data Lake Hadoop

Data Engineering Weekly #159

Data Engineering Weekly

FEBRUARY 18, 2024

Modern data stack vendors chose speed, and never attempted to truly build something together. link] Mikkel Dengsøe: Data ownership - A practical guide Data Ownership is the fundamental construct to build reliable data engineering practices. I believe the data ownership problem is much deeper than simple metadata management.

Data Engineering

Data Engineering Data Engineer Engineering Data

How to Make Your Own Google Chrome Extension?

Workfall

AUGUST 23, 2023

These extensions are designed to enhance and customize the browsing experience by adding new features, modifying existing ones, or integrating with other web services. These components will serve as the building blocks of our extension’s functionality. We will come up with more such use cases in our upcoming blogs.

Metadata

Metadata Coding Project Utilities

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

For these reasons, and others detailed in our original PubSub Client blog post , our team has decided to invest in building, productionalizing, and most recently open-sourcing PubSub Client (PSC). years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results.

Kafka

Kafka Java Software Engineer Software Engineering

Highest Paying Data Science Jobs in the World

Knowledge Hut

MAY 9, 2024

In this blog post, we will look at some of the world's highest paying data science jobs, what they entail, and what skills and experience you need to land them. Responsibilities Data architects assess an organization's data sources and design plans for centralized data management. What is Data Science?

Data Science

Data Science Data Mining Data Architect Programming Language

Scalable Annotation Service?—?Marken

Netflix Tech

JANUARY 25, 2023

For example, we have a service that stores a movie entity’s metadata or a service that stores metadata about images. In Marken, an annotation is a piece of metadata which can be attached to an object from any domain. Actual data is stored in the metadata section of json. In this case it is BOUNDING_BOX.

Algorithm

Algorithm Media Metadata Data Ingestion

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

In this blog post, we talk about the landscape and the challenges in workflows at Netflix. We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg. Foreach pattern: Users build backfill workflows using Maestro foreach support. INSERT OVERWRITE).

Process

Process Data Pipeline Datasets SQL

Using Metrics Layer to Standardize and Scale Experimentation at DoorDash

DoorDash Engineering

APRIL 12, 2023

Building a metrics layer that works for experimentation is not simple, as it should support different types of metrics of varying scale that are used across the diverse range of A/B tests that are being run across different products. We will also dive deep into our design and implementation processes and the lessons we learnt.

SQL

SQL Metadata Raw Data Government

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

A data architect is an IT professional responsible for the design, implementation, and maintenance of the data infrastructure inside an organization. Data architecture is the organization and design of how data is collected, transformed, integrated, stored, and used by a company. What is a data architect?

Data Architect

Data Architect Certification Generalist Big Data

Level Up Your Data Platform With Active Metadata

Building A Data Mesh Platform At PayPal

Webinars

Trending Sources

Metadata Management And Integration At LinkedIn With DataHub

Webinars

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Building Real-time Machine Learning Foundations at Lyft

Building a Control Plane for Lyft’s Shared Development Environment

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Building a Winning Data Quality Strategy: Step by Step

Launching the Engineering Blog

Data Engineering Weekly #152

A Look At The Data Systems Behind The Gameplay For League Of Legends

Rebuilding Netflix Video Processing Pipeline with Microservices

Building Netflix’s Distributed Tracing Infrastructure

A Reflection On Data Observability As It Reaches Broader Adoption

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Build an end to end JSON logging system for clients apps

Data Reprocessing Pipeline in Asset Management Platform @Netflix

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

How LinkedIn Adopted A GraphQL Architecture for Product Development

Building and maintaining the skills taxonomy that powers LinkedIn's Skills Graph

How Rockset Built Vector Search for Scale in the Cloud

Long Live Data Products! Understand the 4 Stages of the Data Product Lifecycle

How the EU’s Digital Operations Resilience Act (DORA) Aims To Strengthen Operational Resilience in Financial Services

Understanding GraphQL Directives: Practical Use-Cases at Zalando

Data Engineering Weekly #162

Software Identifiers through the eyes of Nix

Unleashing the Power of CDC With Snowflake

6 Hard Problems Scaling Vector Search

Ready-to-go sample data pipelines with Dataflow

Unifying Iceberg Tables on Snowflake

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

How to learn data engineering

Real World Change Data Capture At Datacoral

Data Engineering Weekly #159

How to Make Your Own Google Chrome Extension?

Running Unified PubSub Client in Production at Pinterest

Highest Paying Data Science Jobs in the World

Scalable Annotation Service?—?Marken

Incremental Processing using Netflix Maestro and Apache Iceberg

Using Metrics Layer to Standardize and Scale Experimentation at DoorDash

Data Architect: Role Description, Skills, Certifications and When to Hire

Stay Connected