Blog and Engineering - Data Engineering Digest

The fancy data stack—batch version

Christophe Blefari

AUGUST 4, 2023

As a disclaimer, this may not quite make sense in a corporate context, but since this is my blog, I'll do what I want. A few requirements The source data lies in Postgres database, in flat CSV and in Google Sheets. A few requirements The source data lies in Postgres database, in flat CSV and in Google Sheets.

Google Cloud

Google Cloud MongoDB NoSQL Data

GPT-based data engineering accelerators

RandomTrees

FEBRUARY 2, 2024

GPT-based data engineering accelerators make the working of data more accessible. These accelerators combine information from different sources. DataGPT OpenAI developed DataGpt for performing data engineering tasks. Genie Genie is open source and flexible and used to create custom data engineering pipelines.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Data Engineering Weekly #124

Data Engineering Weekly

MARCH 26, 2023

Now you can win $1,000 cash by contributing a Transformation to our open-source library. Data Engineering Weekly readers get a 20% discount by applying Promo Code: DataWeekly20 Data Council website: [link] The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. 🤔] engineering.

Data Engineering

Data Engineering Data Engineer Engineering Lambda Architecture

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

#ClouderaLife Spotlight: Amogh Desai, Software Engineer II

Cloudera

FEBRUARY 15, 2023

This month’s #ClouderaLife Spotlight features software engineer Amogh Desai. Snatching victory from the jaws of defeat Amogh and his fellow hackathon team members felt the rush of victory after winning Cloudera’s 2022 global hackathon in the product development category. One way he does this is through blog writing.

Software Engineer

Software Engineer Software Engineering Engineering Recruitment

Snowpark ML: The ‘Easy Button’ for Open Source LLM Deployment in Snowflake

Snowflake

SEPTEMBER 5, 2023

Open source generative models such as Meta’s Llama 2 are pivotal in making that possible. Starting from your data in Snowflake, you can quickly spin up a powerful open source LLM (in this case, Llama2) within Snowflake, securely access your data, and accomplish this workflow in minutes. Let’s see how. json", lines=True).convert_dtypes()

Medical

Medical Python Government Datasets

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

In this blog post, we will discuss such technologies. Hadoop is an open-source framework that enables distributed processing of large data sets across clusters of commodity servers. Big data technologies can be categorized into four broad categories: batch processing, streaming, NoSQL databases, and data warehouses.

Big Data

Big Data Technology NoSQL Hadoop

Data Engineering Weekly #132

Data Engineering Weekly

MAY 28, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. If you want to write a career guidance series for Data Engineering Weekly , Please DM me on LinkedIn.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

How to Become a Data Engineer in 2024?

Knowledge Hut

DECEMBER 26, 2023

Data Engineering is typically a software engineering role that focuses deeply on data – namely, data workflows, data pipelines, and the ETL (Extract, Transform, Load) process. According to reports by DICE Insights, the job of a Data Engineer is considered the top job in the technology industry in the third quarter of 2020.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Data Engineering Weekly #109

Data Engineering Weekly

NOVEMBER 27, 2022

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. I have a long list of thoughts on this conversation, which might need a blog post on its own.

Data Engineering

Data Engineering Data Engineer Engineering SQL

Data News — Week 23.13

Christophe Blefari

MARCH 31, 2023

We are slowly approaching the 2-years anniversary of the blog and the newsletter. To be honest time flies and I’d have preferred to do more for the blog in the start of the year but my freelancing activities and my laziness took me so much. Now you need to do multiplication and open and read 5 pages to understand the pricing.

Bytes

Bytes Data Google Cloud Education

Data Engineering Weekly #120

Data Engineering Weekly

FEBRUARY 26, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Data Contract platforms are the same, so this space is wide open, waiting for disruption.

Data Engineering

Data Engineering Data Engineer Engineering Raw Data

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Collecting, cleaning, and organizing data into a coherent form for business users to consume are all standard data modeling and data engineering tasks for loading a data warehouse. Feature engineering: Data is transformed to support ML model training. “The features you use influence more than everything else the result.

Engineering

Engineering Raw Data Data Science Scala

Data Engineer Salary in USA: How Much Can You Make in 2023?

Knowledge Hut

FEBRUARY 16, 2023

Demand for data engineers is at a peak today globally due to the massive amount of data that companies accumulate and work with this data to draw actionable insights and make better business decisions. That's where the data engineer comes into the picture, making it a demanding profession today. What Does a Data Engineer Do?

Data Engineering

Data Engineering Data Engineer Engineering Healthcare

Meta contributes new features to Python 3.12

Engineering at Meta

OCTOBER 5, 2023

Open source at Meta is an important part of how we work and share our learnings with the community. For several years, we have been sharing our work on Python and CPython through our open source Python runtime, Cinder. appeared first on Engineering at Meta. This week’s release of Python 3.12 For the Python 3.12

Python

Python Programming Language Coding Programming

Mastering AI-Powered Product Development: Introducing Promptimize for Test-Driven Prompt…

Maxime Beauchemin

APRIL 26, 2023

Mastering AI-Powered Product Development: Introducing Promptimize for Test-Driven Prompt Engineering originally posted here-> [link] AI, AGI, LLM, and GPT are the buzzwords of the moment. Although we’re still figuring out new patterns, we know that prompt engineering is a crucial piece of the puzzle. What a stellar assistant!

SQL

SQL Database Engineering Software Engineer

Data News — Snowflake and Databricks summits

Christophe Blefari

JULY 3, 2023

💡 If you want another view on both the conferences Ananth from Data Engineering Weekly wrote about the conferences extravaganza and a few trends he wanted to chat about. Databricks acquires MosaicML for $1.3b— It should land in data economy category but you know. Also you might know Mode through Benn Stancil blog.

SQL

SQL Data Kafka AWS

All of Netflix’s HDR video streaming is now dynamically optimized

Netflix Tech

NOVEMBER 29, 2023

As noted in an earlier blog post , we began developing an HDR variant of VMAF; let’s call it HDR-VMAF. Yes, we are committed to supporting the open-source community. Improvements have been seen across all device categories ranging from TVs to mobiles and tablets.

Metadata

Metadata Electronics Algorithm Software Engineer

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Co-Authors: Sumedh Sakdeo , Lei Sun , Sushant Raikar , Stanislav Pak , and Abhishek Nath Introduction At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. Unfortunately, there is currently no system in open source that unifies them through a single control plane.

Big Data

Big Data Data Management Management Metadata

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

In this case study, LinkedIn's Bingfeng Xia, Engineering Manager, and Xinyu Liu, Senior Staff Engineer, shed light on how the Apache Beam programming model's unified, portable, and user-friendly data processing framework has enabled a multitude of sophisticated use cases and revolutionized streaming processing at LinkedIn.

Process

Process Lambda Architecture Kafka Machine Learning

How DoorDash Standardized and Improved Microservices Caching

DoorDash Engineering

OCTOBER 19, 2023

Each team manages their own data and exposes access through gRPC services, an open-source remote procedure call framework used to build scalable APIs. Problems: Cache staleness : While implementing caching for a method is straightforward, it’s challenging to ensure that the cache remains updated with the original data source.

Database

Database Coding Java Accessible

7 Stunning React JS Projects for Beginners in 2023

Knowledge Hut

OCTOBER 26, 2023

Web Development is the category of software development process which involves the design, development and maintenance of a full website or mobile website for the Internet. Web developers are software engineers who create and maintain websites. Source Code : To-Do List - GitHub 2. Source Code : Simple Calculator 3.

Project

Project Portfolio Electronics Food

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. How does it work?

Metadata

Metadata MongoDB Scala MySQL

Bring Your Own Algorithm to Anomaly Detection

Pinterest Engineering

OCTOBER 17, 2023

Charles Wu | Software Engineer; Isabel Tallam | Software Engineer; Kapil Bajaj | Engineering Manager Overview In this blog, we present a pragmatic way of integrating analytics, written in Python, with our distributed anomaly detection platform, written in Java. To explore and apply to open roles, visit our Careers page.

Algorithm

Algorithm Java Python Software Engineer

Automation tool to Convert Informatica Code to Talend

RandomTrees

APRIL 18, 2024

On the other hand, Talend provides a comprehensive suite of open-source tools for data integration, offering similar functionalities with a focus on ease of use and flexibility. Having a clear understanding of the source data structures, transformation logic, and target requirements is crucial.

Coding

Coding Retail Metadata Python

Failure Mitigation for Microservices: An Intro to Aperture

DoorDash Engineering

MARCH 14, 2023

In this post, we evaluate the open-source project Aperture and how it enables a global failure mitigation plan for our services. At DoorDash, we view each failure as a learning opportunity and sometimes share our insights and lessons learned in public blog posts to show our commitment to reliability and knowledge sharing.

Metadata

Metadata Java Database Systems

Generating your shopping list with AI: recommendations at Picnic

Picnic Engineering

JANUARY 9, 2024

Especially if the products are in a category the customer has never bought before. Additionally they are all implemented in the open-source library RecBole which is a really handy package for comparing recommendation models in Python in a structured and reproducible way. Explore categories are easier!

Machine Learning

Machine Learning Datasets Algorithm Systems

Tailored Support Designed for You

Cloudera

MAY 24, 2022

? ?. At Cloudera we’re building the world’s only hybrid data platform that’s founded on open source and truly hybrid. Customers that fit into the latter category tend to have complex implementations or run mission-critical workloads on CDP. Release frequency of open-source Apache projects included in CDP.

Designing

Designing Cloud Project Technology

Why Reinvent the Wheel? The Challenges of DIY Open Source Analytics Platforms

Cloudera

JULY 24, 2023

In their effort to reduce their technology spend, some organizations that leverage open source projects for advanced analytics often consider either building and maintaining their own runtime with the required data processing engines or retaining older, now obsolete, versions of legacy Cloudera runtimes (CDH or HDP).

Software Engineer

Software Engineer Software Engineering Project Coding

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Pinterest Engineering

OCTOBER 31, 2023

Harry Zhang, Jiajun Wang, Yi Li, Shunyao Li, Ming Zong, Haniel Martino, Cathy Lu, Quentin Miao, Hao Jiang, James Wen, David Westbrook | Cloud Runtime Team Image Source: [link] Overview Modern compute platforms are foundational to accelerating innovation and running applications more efficiently.

Architecture

Architecture Pipeline-centric Accessible Accessibility

How to Build a 5-Layer Data Stack

Monte Carlo

JULY 19, 2023

The first layer of your stack will generally fall into one of three categories: a data warehouse solution like Snowflake that handles predominantly structured data; a data lake that focuses on larger volumes of unstructured data; and a hybrid solution like Databricks’ Lakehouse that combines elements of both. Image courtesy of Databricks.

Building

Building Business Intelligence Cloud Storage BI

Forge Your Career Path with Best Data Engineering Certifications

ProjectPro

FEBRUARY 21, 2023

With so many data engineering certifications available , choosing the right one can be a daunting task. There are over 133K data engineer job openings in the US, but how will you stand out in such a crowded job market? The answer is- by earning professional data engineering certifications! AWS or Azure? Don’t worry!

Certification

Certification Data Engineering Data Engineer Engineering

PyTorch Introduction — Using Custom Data

DareData

FEBRUARY 28, 2024

Since ChatGPT’s release, deep learning libraries have arguably garnered the most attention among data scientists and machine learning engineers, particularly due to the current practical applications they enable. We’ve used few custom datasets in our examples and previous blog posts. Let’s start!

Datasets

Datasets Deep Learning Data Machine Learning

How to Build a 5-Layer Modern Data Stack (with Example Tools)

Monte Carlo

JANUARY 27, 2024

The first layer of your stack will generally fall into one of three categories: a data warehouse solution like Snowflake that handles predominantly structured data; a data lake that focuses on larger volumes of unstructured data; and a hybrid solution like Databricks’ Lakehouse that combines elements of both. Image courtesy of Databricks.

Building

Building Business Intelligence Cloud Storage BI

Data Labeling in Machine Learning: Process, Types, and Best Practices

Knowledge Hut

JULY 28, 2023

If some terminologies in the blog around Machine Learning seems unfamiliar to you, don’t worry we have the Best Data Science courses to help you out. These tools are either open source - making their usage free for everyone - or we need to pay a subscription fee to use their service. It is the most used labeling type.

Machine Learning

Machine Learning Process Datasets Raw Data

Azure Synapse: Unlocking the Power of Your Data

Edureka

MAY 9, 2023

With Azure Synapse, they were able to integrate their data sources into a single platform, providing them with a unified view of their data. In this blog, we’ll explore the unique features and capabilities of Azure Synapse and how businesses can leverage them to drive growth and success. U nique Features of Azure Synapse 1.

BI

BI Retail Datasets Machine Learning

Lyft’s Reinforcement Learning Platform

Lyft Engineering

MARCH 12, 2024

It recommends different news article categories to a user based on the time of day and changing preferences over time. Model The model’s action space covers four different news article categories: [“politics”, “sports”, “music”, “food”]. Library We leverage open-source libraries like Vowpal Wabbit and RLlib for modeling.

Algorithm

Algorithm Machine Learning Datasets Food

30+ Free Datasets for Your Data Science Projects in 2023

Knowledge Hut

NOVEMBER 28, 2023

With different sources of data, we can leverage the information to drive good business understanding. Types of Datasets Datasets can be public or private, depending on their source. Data Science Data Sets for Public Data Sources Public data sources can be in various forms. are also examples of datasets. Link to Dataset 2.

Datasets

Datasets Data Science Project Banking

Introducing the dbt_project_evaluator: Automatically evaluate your dbt project for alignment with best practices

dbt Developer Hub

NOVEMBER 29, 2022

Through solving these problems over and over, the Professional Services team began to hone our best practices for working with dbt and how analytics engineers could improve their dbt project. We wrote articles on the Developer Blog (see 1 , 2 , and 3 ), gave Coalesce talks , and created training courses. Don’t believe me???

Project

Project Professional Services SQL Coding

Easiest Full Stack Project Ideas To Create Your Portfolio

Knowledge Hut

NOVEMBER 28, 2023

There can be different categories of real-time data like mortality rate due to COVID, population growth, emails sent today, google searches made in one day, etc. The core idea can be to create a platform that updates statistics in real-time by collating data from numerous sources.

Portfolio

Portfolio Project Food Entertainment

Data Engineer vs Data Scientist- The Differences You Must Know

ProjectPro

JUNE 9, 2021

This blog on Data Science vs. Data Engineering presents a detailed comparison between the two domains. vs. What does a Data Engineer do? Are you a Data Scientist or a Data Engineer? Is data engineering more important than data science? Data Engineer vs Data Scientist: Which is better?

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

11 Predictions Data Experts Have for the Year Ahead

Snowflake

JANUARY 26, 2023

Finally, we will see the Open Cybersecurity Scheme Framework (OCSF) rise as the vendor-neutral standard for security data.” Business users are no longer patiently waiting for data scientists and ML engineers to unlock the value of data; they want to extract insights from data themselves.

Retail

Retail Healthcare Unstructured Data Media

Upscaling LinkedIn's Profile Datastore While Reducing Costs

LinkedIn Engineering

MAY 9, 2023

In the past, we addressed latency, throughput and cost issues by migrating off Oracle onto Espresso , an open-source document platform, and adding more nodes. The decision also set forth a number of engineering problems because the cache isn’t backed by a primary storage infrastructure.

Database

Database Project Datasets Designing

An A-Z Data Adventure on Cloudera’s Data Platform

Cloudera

DECEMBER 21, 2020

In this blog we will take you through a persona-based data adventure, with short demos attached, to show you the A-Z data worker workflow expedited and made easier through self-service, seamless integration, and cloud-native technologies. A Cloudera Data Engineering service exists. Assumptions. Company data exists in the data lake.

Banking

Banking Data Lake Data Data Warehouse

How To Switch To Data Science From Your Current Career Path?

Knowledge Hut

NOVEMBER 27, 2023

Data cleansing / Data scrubbing Dealing with incongruous data, like misspelled categories or missing values. Understand topics like data preprocessing, feature engineering, and model evaluation. Staying updated with the latest trends and technologies in data science: Enroll yourself into online courses, webinars, blogs, and podcasts.

Data Science

Data Science Datasets Machine Learning Algorithm

The fancy data stack—batch version

GPT-based data engineering accelerators

Webinars

Trending Sources

Data Engineering Weekly #124

Webinars

#ClouderaLife Spotlight: Amogh Desai, Software Engineer II

Snowpark ML: The ‘Easy Button’ for Open Source LLM Deployment in Snowflake

Big Data Technologies that Everyone Should Know in 2024

Data Engineering Weekly #132

How to Become a Data Engineer in 2024?

Data Engineering Weekly #109

Data News — Week 23.13

Data Engineering Weekly #120

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Engineer Salary in USA: How Much Can You Make in 2023?

Meta contributes new features to Python 3.12

Mastering AI-Powered Product Development: Introducing Promptimize for Test-Driven Prompt…

Data News — Snowflake and Databricks summits

All of Netflix’s HDR video streaming is now dynamically optimized

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

How DoorDash Standardized and Improved Microservices Caching

7 Stunning React JS Projects for Beginners in 2023

Level Up Your Data Platform With Active Metadata

Bring Your Own Algorithm to Anomaly Detection

Automation tool to Convert Informatica Code to Talend

Failure Mitigation for Microservices: An Intro to Aperture

Generating your shopping list with AI: recommendations at Picnic

Tailored Support Designed for You

Why Reinvent the Wheel? The Challenges of DIY Open Source Analytics Platforms

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

How to Build a 5-Layer Data Stack

Forge Your Career Path with Best Data Engineering Certifications

PyTorch Introduction — Using Custom Data

How to Build a 5-Layer Modern Data Stack (with Example Tools)

Data Labeling in Machine Learning: Process, Types, and Best Practices

Azure Synapse: Unlocking the Power of Your Data

Lyft’s Reinforcement Learning Platform

30+ Free Datasets for Your Data Science Projects in 2023

Introducing the dbt_project_evaluator: Automatically evaluate your dbt project for alignment with best practices

Easiest Full Stack Project Ideas To Create Your Portfolio

Data Engineer vs Data Scientist- The Differences You Must Know

11 Predictions Data Experts Have for the Year Ahead

Upscaling LinkedIn's Profile Datastore While Reducing Costs

An A-Z Data Adventure on Cloudera’s Data Platform

How To Switch To Data Science From Your Current Career Path?

Stay Connected