Blog and Bytes - Data Engineering Digest

LLM finetuning memory requirements by Alex Birch

Scott Logic

NOVEMBER 23, 2023

Cost increases when gradient accumulation is enabled, or becomes ~free if used in concert with DDP DDP usually costs ~4 bytes/param, but becomes cheaper if used in concert with AMP DDP can be made 2.5 Transformer Math does not mention a "4 bytes/param master gradients" cost.

Bytes

Bytes Education IT Utilities

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components. More information about the architecture can be found in the GokuL blog and the cost reduction blog.

Database

Database Bytes Kafka Architecture

Launching the Engineering Blog

Zalando Engineering

JUNE 30, 2020

Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog. What customizations we applied to design the blog and the publishing process. Static Site Generator Our previous tech blog used a CMS which only a limited number of people had access to. So which static site generator to choose?

Engineering

Engineering Bytes AWS Python

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production. an array within a map, within a union, etc…). Default is 128 * 1024 (128KB).

Datasets

Datasets Bytes Process Data Ingestion

Carbon Emissions of End-User Devices: Part One - SWD Method by David Rees

Scott Logic

APRIL 5, 2024

Introduction This series of blog posts discusses the methods of estimating carbon emissions of end-user devices. After intending to write a single blog post, the research journey prompted me to reconsider how to present this to an audience. js is a javascript library that returns an estimated CO2e value for a web page.

Bytes

Bytes Systems Data Storage Designing

Postgres Aurora DB major version upgrade with minimal downtime

Lyft Engineering

MARCH 11, 2024

This blog would be of immense help to understand what happens under the hood with AWS blue/green deployment! The diff_bytes is 0 now! We now need to reset sequences in, which we accomplished with the following script: [link] This ensures that the sequence starts from the last entry of the individual tables.

Bytes

Bytes PostgreSQL AWS Database

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality. This led us to use a number of observability tools, including VPC flow logs , ebpf agent metrics , and Envoy networking bytes metrics to rectify the situation.

Bytes

Bytes Cloud Management PostgreSQL

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Workfall

SEPTEMBER 26, 2023

Reading Time: 9 minutes In this blog, we will cover: What are Server-Sent Events? We’re taking in 16 bytes of data at a time from the stream. This function will provide basic units of data in the form of raw bytes. These bytes can then be converted into a readable JSON format. appeared first on The Workfall Blog.

Python

Python Bytes Coding Data

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

In this blog we will dive into how CDF-PC’s support for NiFi reporting tasks can be used to monitor key metrics in Prometheus and Grafana. By using component_name and “Hello World Prometheus,” we’re monitoring the bytes received aggregated by the entire process group and therefore the flow. Select the nifi_amount_bytes_received metric.

Bytes

Bytes Architecture Designing Building

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

For a more detailed introduction to BPF portability and CO-RE, see Andrii Nakryiko’s blog post on the subject. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. The post BPFAgent: eBPF for Monitoring at DoorDash appeared first on DoorDash Engineering Blog.

Bytes

Bytes PostgreSQL Coding Database

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications. I experienced similar drawbacks to what Lyft is talking about in Druid. Rebalancing, the awkward middle child.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

Patching the PostgreSQL JDBC Driver

Zalando Engineering

NOVEMBER 8, 2023

Introduction This blog post describes a recent contribution from Zalando to the Postgres JDBC driver to address a long-standing issue with the driver’s integration with Postgres’ logical replication that resulted in runaway Write-Ahead Log (WAL) growth. However as you may imagine, this blog post concerns a path that is anything but happy.

PostgreSQL

PostgreSQL Java Database Bytes

Data News — Week 23.13

Christophe Blefari

MARCH 31, 2023

We are slowly approaching the 2-years anniversary of the blog and the newsletter. To be honest time flies and I’d have preferred to do more for the blog in the start of the year but my freelancing activities and my laziness took me so much. This newsletter is about money ( credits ) Dear readers, already 3 months done in 2023.

Bytes

Bytes Data Google Cloud Education

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Our previous tech blog Packaging award-winning shows with award-winning technology detailed our packaging technology deployed on the streaming side. Writable MezzFS As described in a previous blog post, MezzFS is a tool developed by Netflix that allows cloud storage objects to be mounted as local files via FUSE.

Cloud

Cloud Bytes Cloud Storage Media

Google BigQuery: A Game-Changing Data Warehousing Solution

ProjectPro

JANUARY 24, 2023

This blog is your comprehensive guide to Google BigQuery, its architecture, and a beginner-friendly tutorial on how to use Google BigQuery for your data warehousing activities. This blog presents a detailed overview of Google BigQuery and its architecture. Due to this, combining and contrasting the STRING and BYTE types is impossible.

Bytes

Bytes Google Cloud Data Warehouse Datasets

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

This blog post explores how Snowflake can help with this challenge. In the cloud, computing can be measured in various ways, like bytes scanned or CPU cycles. But what if security teams didn’t have to make tradeoffs? Detection and investigation processing: Security teams depend on detection rules to find important events automatically.

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

Pinterest is now on HTTP/3

Pinterest Engineering

FEBRUARY 23, 2023

These advancements fit well with Pinterest use cases — enabling faster connection establishment (time to first byte of first request), improved congestion control (large media as we have), multiplexing without TCP head-of-line blocking (multiple downloads at the same time), and continued in-flight requests when pinners’ device network/ip changes.

Bytes

Bytes Media Software Engineer Software Engineering

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. In a later blog, we will go into details about how to take advantage of the time travel feature. The rest of the blog will go into this in more detail. Opening files is costly.

Bytes

Bytes Metadata Data Lake SQL

Data Engineering Weekly #117

Data Engineering Weekly

FEBRUARY 5, 2023

The ML for large-scale production systems highlights the improvement made from the existing heuristic in the YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving the byte miss ratio at the peak by ~9%. The blog talks about four types of architecture.

Data Engineering

Data Engineering Data Engineer Engineering Food

Observability in Your Data Pipeline: A Practical Guide

Databand.ai

JUNE 8, 2023

Monitor total records processed or bytes transferred to ensure smooth operation. Learn more in our detailed guide to data quality (coming soon) Data volumes : Understanding the amount of data processed within a specific time frame helps you assess system capacity and resource utilization.

Data Pipeline

Data Pipeline Bytes Raw Data Data Collection

Solving Espresso’s scalability and performance challenges to support our member base

LinkedIn Engineering

SEPTEMBER 7, 2023

Espresso System Overview Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). Improvements to Encode/Decode performance This section focuses on the performance improvements we made when converting bytes to Http objects and vice versa.

Bytes

Bytes Transportation Utilities Java

ZIO Streams: A Long-Form Introduction

Rock the JVM

AUGUST 9, 2022

FileInputStream In our example later, we are going to process blog posts to parse tag meta-data. If I go back and edit a blog post, there is no guarantee that when the file is re-processed that the first 1000 elements will be the same as before. the first 1000 elements), the data will be constant, even if re-processed later.

Scala

Scala Bytes Kafka Programming

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

This blog post is my note after reading the paper: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. In the rest of this blog, we will see how Google enables this contribution. See you next blog!

Google Cloud

Google Cloud Process Cloud Lambda Architecture

The Rise of Unstructured Data

Cloudera

NOVEMBER 15, 2021

This blog discusses quantifications, types, and implications of data. The International Data Corporation (IDC) estimates that by 2025 the sum of all data in the world will be in the order of 175 Zettabytes (one Zettabyte is 10^21 bytes). The post The Rise of Unstructured Data appeared first on Cloudera Blog.

Unstructured Data

Unstructured Data Pipeline-centric Database-centric Entertainment

Processing medical images at scale on the cloud

Tweag

APRIL 19, 2023

In this blog post, I will explain the underlying technical challenges and share the solution that we helped implement at kaiko.ai , a MedTech startup in Amsterdam that is building a Data Platform to support AI research in hospitals. A solution is to read the bytes that we need when we need them directly from Blob Storage.

Medical

Medical Process Cloud Bytes

Expert Roundtable: Batch vs Streaming in the Modern Data Stack [Video]

Rockset

AUGUST 11, 2022

Our esteemed roundtable included leading practitioners, thought leaders and educators in the space, including: Ben Rogojan , aka Seattle Data Guy , is a data engineering and data science consultant (now based in the Rocky Mountain city of Denver) with a popular YouTube channel , Medium blog , and newsletter. Doing the pre-work is important.

Bytes

Bytes Consulting Kafka MongoDB

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Instead, in this post I will point you to an earlier blog post where I already answered that question and then I will focus on what should be your next question: now that I’m relying on Jaeger to trace how data is flowing through my distributed system, what if Jaeger goes down? Distributed tracing with Apache Kafka and Jaeger.

Kafka

Kafka Systems Bytes Project

Data Quality + Data Lineage = ???

Datakin

SEPTEMBER 2, 2021

Blog Data Quality + Data Lineage = Written by Peter Hicks on Sep 2, 2021 In a prior life, I dwelled in the day-to-day cycles of an e-commerce platform. In previous blog posts, we’ve talked before about the importance of understanding data lineage from debugging to privacy and governance.

Bytes

Bytes Food Datasets Data Pipeline

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

Pyoung = Seden / Ralloc where Pyoung is the period between young GC, Seden is the size of Eden and Ralloc is the rate of memory allocations (bytes per second). To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site.

Kafka

Kafka Bytes Architecture Utilities

Packaging award-winning shows with award-winning technology

Netflix Tech

FEBRUARY 25, 2021

By Cyril Concolato Introduction In previous blog posts, our colleagues at Netflix have explained how 4K video streams are optimized , how even legacy video streams are improved and more recently how new audio codecs can provide better aural experiences to our members. Figure 1?—?Simplified

Technology

Technology Bytes Media Entertainment

Optimizing Hive on Tez Performance

Cloudera

MAY 9, 2022

Refer to the YARN – The Capacity Scheduler blog to understand these configuration settings.) . This can be tuned using the user limit factor of the YARN queue (refer the details in Capacity Scheduler blog ). Tez determines the reducers automatically based on the data (number of bytes) to be processed.

Bytes

Bytes SQL Utilities Professional Services

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

If you want to follow along and execute all the commands included in this blog post (and the next), you can check out this GitHub repository , which also includes the necessary Docker Compose functionality for running a compatible KSQL and Confluent Platform environment using the recently released Confluent 5.2.1. Sample repository.

Kafka

Kafka Management Bytes SQL

My First Year as an Engineering Manager at Zalando

Zalando Engineering

SEPTEMBER 25, 2023

My first stop was the Zalando Engineering Blog - a real treasure for someone like me who was curious about the engineering culture and practices at what would be my new company. Since I love reading and writing blog posts, I even dreamt of contributing here someday.

Management

Management Engineering Software Engineer Software Engineering

Netflix Drive

Netflix Tech

MAY 5, 2021

We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post. Data Store Characteristics Netflix Drive relies on a data store that allows streaming bytes into files/objects persisted on the storage media. The transfer mechanism for transport of bytes is a function of the data store.

Metadata

Metadata Bytes Media Cloud Storage

Conscientious Computing - facing into big tech challenges by Oliver Cronk

Scott Logic

OCTOBER 26, 2023

This is the first in our latest series of blogs on sustainable technology that will explore these issues and, where possible, offer pragmatic suggestions that hopefully raise thought-provoking questions to ask yourself, your suppliers, and technology teams. As much as we are a business, we have a social mission.

Bytes

Bytes Electronics Cloud Education

BazelCon Community Day - Munich

Tweag

DECEMBER 6, 2023

The main reason to build Buck2 was to go faster , and as such, it has: a single dependency graph with no phases remote execution, with pre-computed Merkle trees, and virtual files to provide builds without the bytes and inputs through EdenFS. You can also read this article on the EngFlow Blog.

Bytes

Bytes Coding Building Systems

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Cloudera

OCTOBER 6, 2020

In this blog, I will demonstrate how COD can easily be used as a backend system to store data and images for a simple web application. The post Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask appeared first on Cloudera Blog.

Database

Database Building Bytes NoSQL

Booking’s Journey with Brotli

Booking.com Engineering

DECEMBER 10, 2020

Based on benchmarks and blog posts out in the wild, brotli is able to get text-like payloads (HTML, Javascript, CSS) about 5–15% smaller than the gzipped size, and it’s not especially slower or more resource-intensive during decompression. When we enabled brotli in a straightforward manner, it reduced bytes sent as expected.

Bytes

Bytes Recruitment Coding Engineering

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

Netflix Tech

SEPTEMBER 3, 2021

If a consumer is only interested in production titles and format, they can set a FieldMask with paths “title” and “format”: [link] Masking fields Please note, even though code samples in this blog post are written in Java, demonstrated concepts apply to any other language supported by protocol buffers. Field names are not included.

Designing

Designing Java Bytes Utilities

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

sent 11,286 bytes received 172 bytes 2,546.22 However, we can continue without enabling TLS for the purpose of this blog. The post HDFS Data Encryption at Rest on Cloudera Data Platform appeared first on Cloudera Blog. [root@ccycloud-4 ~]# rsync -zav --exclude.ssl /var/lib/keytrustee/.keytrustee keytrustee ccycloud-3.cdpvcb.root.hwx.site:/var/lib/keytrustee/.

MySQL

MySQL Java Bytes Data

Streaming Big Data Files from Cloud Storage

Towards Data Science

JANUARY 26, 2023

Check out this informative blog for more details on how S5cmd works and its significant performance advantages. Here we show how to download specific byte-ranges of the file using the Boto3 get_object data streaming API. CPU cores and TCP connections). The S5cmd concurrency flag allows for controlling the download speed.

Cloud Storage

Cloud Storage Big Data Cloud AWS

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! For proof of this, look no further than systems like Flink and Camunda, which rely on RocksDB.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! For proof of this, look no further than systems like Flink and Camunda, which rely on RocksDB.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

HBase Clusters Data Synchronization with HashTable/SyncTable tool

Cloudera

OCTOBER 22, 2020

Replication ( covered in this previous blog article ) has been released for a while and is among the most used features of Apache HBase. Snapshots, BulkLoad, CopyTable are well-known examples of such tools covered in previous Cloudera blog posts. Bytes Read=0. Bytes Written=6811788. File Input Format Counters.

Bytes

Bytes Datasets Data Data Ingestion

LLM finetuning memory requirements by Alex Birch

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Webinars

Trending Sources

Launching the Engineering Blog

Webinars

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Carbon Emissions of End-User Devices: Part One - SWD Method by David Rees

Postgres Aurora DB major version upgrade with minimal downtime

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

BPFAgent: eBPF for Monitoring at DoorDash

Data Engineering Weekly #151

Patching the PostgreSQL JDBC Driver

Data News — Week 23.13

Netflix Cloud Packaging in the Terabyte Era

Google BigQuery: A Game-Changing Data Warehousing Solution

How to Navigate the Costs of Legacy SIEMS with Snowflake

Pinterest is now on HTTP/3

Optimization Strategies for Iceberg Tables

Data Engineering Weekly #117

Observability in Your Data Pipeline: A Practical Guide

Solving Espresso’s scalability and performance challenges to support our member base

ZIO Streams: A Long-Form Introduction

The Stream Processing Model Behind Google Cloud Dataflow

The Rise of Unstructured Data

Processing medical images at scale on the cloud

Expert Roundtable: Batch vs Streaming in the Modern Data Stack [Video]

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Data Quality + Data Lineage = ???

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Packaging award-winning shows with award-winning technology

Optimizing Hive on Tez Performance

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

My First Year as an Engineering Manager at Zalando

Netflix Drive

Conscientious Computing - facing into big tech challenges by Oliver Cronk

BazelCon Community Day - Munich

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Booking’s Journey with Brotli

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

HDFS Data Encryption at Rest on Cloudera Data Platform

Streaming Big Data Files from Cloud Storage

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

HBase Clusters Data Synchronization with HashTable/SyncTable tool

Stay Connected