Blog, Data Process, Designing and Metadata

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

We just announced the general availability of Cloudera DataFlow Designer , bringing self-service data flow development to all CDP Public Cloud customers. In our previous DataFlow Designer blog post , we introduced you to the new user interface and highlighted its key capabilities.

Data Pipeline

Data Pipeline Designing Kafka Metadata

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this context, managing the data, especially when it arrives late, can present a substantial challenge! In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! Let’s dive in! To solve these problems, we came up with Psyberg!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Data Reprocessing Pipeline in Asset Management Platform @Netflix

Netflix Tech

MARCH 10, 2023

This platform has evolved from supporting studio applications to data science applications, machine-learning applications to discover the assets metadata, and build various data facts. During this evolution, quite often we receive requests to update the existing assets metadata or add new metadata for the new features added.

Management

Management Kafka Metadata Media

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Data Lineage Tools: Key Capabilities and 5 Notable Solutions

Databand.ai

JULY 19, 2023

This capability is particularly useful in complex data landscapes, where data may pass through multiple systems and transformations before reaching its final destination Impact analysis: When changes are made to data sources or data processing systems, it’s critical to understand the potential impact on downstream processes and reports.

Pipeline-centric

Pipeline-centric Data Governance Metadata Government

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the first part of this series, we talked about design patterns for data creation and the pros & cons of each system from the data contract perspective. In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

He wrote some years ago 3 articles defining data engineering field. Some concepts When doing data engineering you can touch a lot of different concepts. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill). Is it really modern?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

Slow data processing: Due to the manual nature of many data workflows in legacy architectures, data processing can be time-consuming and resource-intensive. They include the various databases, applications, APIs, and external systems from which data is collected and ingested.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

So, embrace the power of Change Data Capture, and embark on a captivating journey where the magic of real-time data awaits. In this blog, we will cover: What Is CDC and Its Benefits? Types of CDC Audit Columns: This method involves using designated columns within tables to track incremental changes.

Telecommunication

Telecommunication Metadata Healthcare Finance

Data Engineering Weekly #159

Data Engineering Weekly

FEBRUARY 18, 2024

I believe the data ownership problem is much deeper than simple metadata management. Data Ownership is the fundamental construct for Data Products. In a recent poll, 75% of data practitioners say either they are implementing or are thinking of implementing Data Products.

Data Engineering

Data Engineering Data Engineer Engineering Data

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. Data ingestion through ‘s3’. Ozone Namespace Overview.

Data Science

Data Science Cloud Hadoop Metadata

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

In the article, we explore the role of a data architect, discuss the responsibilities and required skills, and share what kind of companies may need such a specialist. What is a data architect? To get a better understanding of a data architect’s role, let’s clear up what data architecture is.

Data Architect

Data Architect Certification Generalist Big Data

Privacy Preserving Single Post Analytics

LinkedIn Engineering

DECEMBER 12, 2023

We are excited to announce the various contributions we have made to provide a privacy-by-design approach to measure and mitigate reidentification risks. This metadata is determined as part of an onboarding process with each new application where we determine which algorithms should be used for each type of query.

Algorithm

Algorithm Metadata SQL Datasets

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

The first generation of the Hive Metastore attempted to address the performance considerations to run SQL efficiently on a data lake. It provided the concept of a database, schemas, and tables for describing the structure of a data lake in a way that let BI tools traverse the data efficiently.

Data Lake

Data Lake Data Warehouse BI SQL

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Azure Data Engineer (DP-203) Certification Cost in 2023

Knowledge Hut

SEPTEMBER 29, 2023

Moreover, what benefits can you expect from a career in Azure Data Engineering? This blog aims to answer these questions, providing a straightforward and professional insight into the world of Azure Data Engineering. Join us on this journey through the exciting realm of Azure Data Engineering.

Certification

Certification Data Engineering Data Engineer Engineering

Data Engineering Weekly #106

Data Engineering Weekly

NOVEMBER 6, 2022

A data contract is a continuous and collaborative system because the business context and requirements won’t be static. I plan to write a series of blogs on Schemata and Data Contract in the coming weeks. Martin kindly stepped in for me to give the update for my promised blog posts.

Data Engineering

Data Engineering Data Engineer Engineering Machine Learning

Data Collection And Management To Power Sound Recognition At Audio Analytic

Data Engineering Podcast

JUNE 29, 2020

This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis. Can you describe how your overall data management system is architected? Can you describe how your overall data management system is architected?

Data Collection

Data Collection Management High Quality Data Metadata

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. Now let’s look at how we designed the tracing infrastructure that powers Edgar. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs.

Building

Building Transportation Metadata Java

2023 Predictions: Data Trends That Will Dominate Business Agenda in APAC

Cloudera

JANUARY 5, 2023

With the right tools in place, distilling actionable insights from data to achieve business objectives or unlock new revenue streams is easily achievable for organizations of all sizes across industries, especially with the availability of self-serve functionalities that do not require specialized ops or cloud expertise.

Banking

Banking Machine Learning Insurance Data Architecture

Discover and Explore Data Faster with the CDP DDE Template

Cloudera

SEPTEMBER 1, 2020

The Data Discovery and Exploration (DDE) template in CDP Data Hub was released as Tech Preview a few weeks ago. DDE is a new template flavor within CDP Data Hub in Cloudera’s public cloud deployment option (CDP PC). It is designed to simplify deployment, configuration, and serviceability of Solr-based analytics applications.

Cloud Storage

Cloud Storage Unstructured Data AWS Analytics Application

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

How to get powerful and actionable insights from any and all of your data, without delay

Cloudera

SEPTEMBER 17, 2020

Cloudera Data Visualization is built on three simple constructs: Data connections – Allows you to set up connections to any and all supported compute engines, be it SQL (Impala, Hive, MySQL), events and time series data (Druid or Impala over Kudu), unstructured data (Solr), or ML workloads (Spark et al), whatever you prefer.

Unstructured Data

Unstructured Data Data Warehouse Pharmaceutical MySQL

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Most were designed for the best-case scenario. Here is a more detailed explanation of how data observability works within the data pipeline: Data ingestion : Observability begins from the point where data is ingested into the pipeline. is a unified data observability platform built for data engineers.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

The governance aspect is perhaps even more important and businesses need to be able to understand where the data comes from. Data lineage, personally identifiable information or PPI and metadata all fall under a broad data governance banner which is critically important in terms of what needs to be protected and mapped out.

Banking

Banking Kafka Cloud Storage Government

The Role of Database Applications in Modern Business Environments

Knowledge Hut

JULY 26, 2023

Database applications also help in data-driven decision-making by providing data analysis and reporting tools. In this blog, we will deep dive into database system applications in DBMS, and their components and look at a list of database applications. InfluxDB): Time series databases are designed to work with time-stamped data.

Database

Database NoSQL Telecommunication MongoDB

Introducing Cloudera Enterprise 6.0

Cloudera

AUGUST 30, 2018

Cloud service architecture – storing and accessing data and services over the Internet instead of your data center or local computer’s hard drive. Machine learning driven business – A focus on the design of systems that can learn from and make decisions and predictions based on data.

Unstructured Data

Unstructured Data Machine Learning Data Warehouse BI

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Rockset

MARCH 18, 2024

That’s because successfully deploying an AI application requires retrieval augmented generation or “RAG” pipelines, processing real-time data streams, chunking data, generating embeddings, storing embeddings and running vector search. They also need to be designed for real-time updates. What is RAG?

Cloud

Cloud Building Metadata Kafka

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. Late arriving facts Late arriving facts can be problematic with a strict immutable data policy.

Data Engineering

Data Engineering Data Engineer Data Process Process

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale. The hardware specifications are included at the end of this blog.

Management

Management Metadata Datasets Architecture

15+ AWS Projects Ideas for Beginners to Practice in 2023

ProjectPro

JULY 23, 2021

AWS (Amazon Web Services) is the world’s leading and widely used cloud platform, with over 200 fully featured services available from data centers worldwide. This blog presents some of the most unique and innovative AWS projects from beginner to advanced levels. Real-time Data Processing Application 7.

AWS

AWS Project Amazon Web Services Cloud Computing

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps , short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization. Accelerated Data Analytics DataOps tools help automate and streamline various data processes, leading to faster and more efficient data analytics.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Its main purpose is to enable easy unit testing of your data pipelines, but it can technically be used in any other situations as a readable data format for small data sets. All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command.

Data Pipeline

Data Pipeline Scala Metadata Food

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

On the other hand, these optimizations themselves need to be sufficiently inexpensive to justify their own processing cost over the gains they bring. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Automation tool to Convert Informatica Code to Talend

RandomTrees

APRIL 18, 2024

Understanding the Challenge Informatica PowerCenter has long been a favoured tool for Extract, Transform, Load (ETL) processes , offering a robust graphical interface for designing workflows and transformations. Having a clear understanding of the source data structures, transformation logic, and target requirements is crucial.

Coding

Coding Retail Metadata Python

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Iceberg is designed to be open and engine agnostic allowing datasets to be shared. The table information (such as schema, partition) is stored as part of the metadata (manifest) file separately, making it easier for applications to quickly integrate with the tables and the storage formats of their choice. Change data capture (CDC).

Metadata

Metadata Data Architecture BI Machine Learning

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

In this blog post, we talk about the landscape and the challenges in workflows at Netflix. We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg. IPS enables users to continue to use the data processing patterns with minimal changes.

Process

Process Data Pipeline Datasets SQL

Unified DataOps: Components, Challenges, and How to Get Started

Databand.ai

AUGUST 30, 2023

These experts will need to combine their expertise in data processing, storage, transformation, modeling, visualization, and machine learning algorithms, working together on a unified platform or toolset.

Data Governance

Data Governance Data Cleanse Government Data Pipeline

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Do ETL and data integration activities seem complex to you? Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Did you know the global big data market will likely reach $268.4 Businesses are leveraging big data now more than ever.

AWS

AWS Scala Metadata Data Lake

Cloudera DataFlow for the Public Cloud: A technical deep dive

Cloudera

AUGUST 16, 2021

CDF-PC enables Apache NiFi users to run their existing data flows on a managed, auto-scaling platform with a streamlined way to deploy NiFi data flows and a central monitoring dashboard making it easier than ever before to operate NiFi data flows at scale in the public cloud. The need for a cloud-native Apache NiFi service.

Cloud

Cloud Unstructured Data Utilities Metadata

The Good and the Bad of Apache Airflow Pipeline Orchestration

AltexSoft

NOVEMBER 7, 2022

Before we start, all those who are new to data engineering can watch our video explaining its general concepts. How data engineering works. Apache Airflow is an open-source Python -based workflow orchestrator that enables you to design, schedule, and monitor data pipelines. Metadata database. Web App Development.

PostgreSQL

PostgreSQL Metadata Python MySQL

Transforming Delimited String Columns into Rows with Snowflake

RandomTrees

MARCH 22, 2024

Snowflake, a popular cloud-based data warehousing platform, offers a powerful solution to this problem through its versatile SQL capabilities. In this article, we will explore how Snowflake enables the splitting of a delimited string column into rows, facilitating more efficient data processing and analysis.

Media

Media Healthcare Datasets Electronics

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

Table of Contents 20 Open Source Big Data Projects To Contribute How to Contribute to Open Source Big Data Projects? 20 Open Source Big Data Projects To Contribute There are thousands of open-source projects in action today. This blog will walk through the most popular and fascinating open source big data projects.

Big Data

Big Data Project Metadata Programming Language

What does a healthy data ecosystem look like?

DareData

AUGUST 12, 2021

One of the main reasons is that there are several pitfalls that are easy to fall into, which only become apparent down the road when your data increases in size or number of users. What is a data ecosystem? Then, some automatic process generates the documentation directly from the code and code comments.

Transportation

Transportation Data Lake Data Warehouse Data

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Webinars

Trending Sources

Data Reprocessing Pipeline in Asset Management Platform @Netflix

Webinars

Data Lineage Tools: Key Capabilities and 5 Notable Solutions

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

How to learn data engineering

DataOps Architecture: 5 Key Components and How to Get Started

Unleashing the Power of CDC With Snowflake

Data Engineering Weekly #159

Apache Ozone Powers Data Science in CDP Private Cloud

Data Architect: Role Description, Skills, Certifications and When to Hire

Privacy Preserving Single Post Analytics

The Future of the Data Lakehouse – Open

The Rise of the Data Engineer

Azure Data Engineer (DP-203) Certification Cost in 2023

Data Engineering Weekly #106

Data Collection And Management To Power Sound Recognition At Audio Analytic

Building Netflix’s Distributed Tracing Infrastructure

2023 Predictions: Data Trends That Will Dominate Business Agenda in APAC

Discover and Explore Data Faster with the CDP DDE Template

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

How to get powerful and actionable insights from any and all of your data, without delay

Data Pipeline Observability: A Model For Data Engineers

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

The Role of Database Applications in Modern Business Environments

Introducing Cloudera Enterprise 6.0

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Functional Data Engineering — a modern paradigm for batch data processing

Boosting Object Storage Performance with Ozone Manager

15+ AWS Projects Ideas for Beginners to Practice in 2023

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Ready-to-go sample data pipelines with Dataflow

Optimizing data warehouse storage

Automation tool to Convert Informatica Code to Talend

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Incremental Processing using Netflix Maestro and Apache Iceberg

Unified DataOps: Components, Challenges, and How to Get Started

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Cloudera DataFlow for the Public Cloud: A technical deep dive

The Good and the Bad of Apache Airflow Pipeline Orchestration

Transforming Delimited String Columns into Rows with Snowflake

20 Best Open Source Big Data Projects to Contribute on GitHub

What does a healthy data ecosystem look like?

Stay Connected