Top Data Engineering Digest Data Engineer Data Engineering Content for Fri.Feb 03, 2023

Fri.Feb 03, 2023

Getting Started with The Basics of Docker

Analytics Vidhya

FEBRUARY 3, 2023

Introduction “Let’s containerize your code to ship worldwide!” If you read the above quote, you must think, what does this all mean? Well, my friend, this is what Docker is. Let me explain it with an example. Say Harish and Lisa are two people working on the same project but on two different systems(say windows and […] The post Getting Started with The Basics of Docker appeared first on Analytics Vidhya.

Coding

Coding Project Systems IT

Table file formats - Change Data Capture: Delta Lake

Waitingforcode

FEBRUARY 3, 2023

It's time to start the 4th part of the Table file formats series. This time the topic will be Change Data Capture, so how to stream all changes made on the table. As for the 3rd part, I'm going to start with Delta Lake.

Data

Data IT

Join 16,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

How to Build and Monitor Systems Using Airflow?

Analytics Vidhya

FEBRUARY 3, 2023

Introduction Do you find yourself spending too much time managing your machine-learning tasks? Are you looking for a way to automate and simplify the process? Airflow can help you manage your workflow and make your life easier with its monitoring and notifications features. Imagine scheduling your ML tasks to run automatically without the need for manual […] The post How to Build and Monitor Systems Using Airflow?

Systems

Systems Building Machine Learning Management

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Data News — Week 23.05

Christophe Blefari

FEBRUARY 3, 2023

Delivering the data news ( credits ) Hey you, it's already February. Every week same analysis for me. I plan too many tasks but I slowly deliver. I guess that's how it is. Still I love this Friday rendezvous that we have together. I'm still amazed by how I changed my old habits to add the writing in my workflow. And it brings me a lot of joy.

BI Google Cloud SQL Machine Learning

Get Better Network Graphs & Save Analysts Time

Many organizations today are unlocking the power of their data by using graph databases to feed downstream analytics, enahance visualizations, and more. Yet, when different graph nodes represent the same entity, graphs get messy. Watch this essential video with Senzing CEO Jeff Jonas on how adding entity resolution to a graph database condenses network graphs to improve analytics and save your analysts time.

Database

YARN or Kubernetes for Apache Spark?

Waitingforcode

FEBRUARY 3, 2023

I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?

Management

How to Implement a Federated Learning Project with Healthcare Data

KDnuggets

FEBRUARY 3, 2023

Learn about Federated Learning and how you can use it in the healthcare sector.

Healthcare

Healthcare Project Data IT

What's new on the cloud for data engineers - part 7 (05-08.2022)

Waitingforcode

FEBRUARY 3, 2023

Four months in cloud history is a huge period of time. Even when 2 of the 4 months are the usual "holiday" months. As you can guess from the title, it's time to see what changed recently on the cloud from a data engineering perspective!

Data Engineering

Data Engineering Data Engineer Cloud Engineering

More Trending

What's new on the cloud for data engineers - part 7 (05-08.2022)

Waitingforcode

FEBRUARY 3, 2023

Data Engineering

Data Engineering Data Engineer Cloud Engineering

AI / ML Survival Guide: Conquer DataOps and Data Composability Challenges and Transform into a Truly Data-Driven Organization

The Modern Data Company

FEBRUARY 3, 2023

Get to the Future Faster – Modernize Your Manufacturing Data Architecture Without Ripping and Replacing Implementing customer lifetime value as a mission-critical KPI has many challenges. Companies need consistent, high-quality data and a straightforward way to measure CLV. In the past, organizations have struggled to implement CLV as a practical, value-generating metric, but a new data solution could help.

Manufacturing

Manufacturing High Quality Data Data Architecture Architecture

Predicate pushdown, why it doesn't work every time?

Waitingforcode

FEBRUARY 3, 2023

Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!

IT Process Data

The Future of Retail: Key Challenges and Opportunities

The Modern Data Company

FEBRUARY 3, 2023

Retail

Retail Manufacturing High Quality Data Data Architecture

Table formats - reading: Delta Lake

Waitingforcode

FEBRUARY 3, 2023

In the previous blog post about Delta Lake you discovered the logic for the writing part. Meantime Delta Lake 2 was released and it's for this brand new version that I'm going to share with you some findings related to the data reading.

IT Data

Understanding User Needs and Satisfying Them

Speaker: Scott Sehlhorst

We know we want to create products which our customers find to be valuable. Whether we label it as customer-centric or product-led depends on how long we've been doing product management. There are three challenges we face when doing this. The obvious challenge is figuring out what our users need; the non-obvious challenges are in creating a shared understanding of those needs and in sensing if what we're doing is meeting those needs.

Certification

AI is Not Here to Replace Us

KDnuggets

FEBRUARY 3, 2023

Is the fear of AI replacing humans justified? Here we have a look at what AI is good for and what it isn’t.

Observable metrics

Waitingforcode

FEBRUARY 3, 2023

Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.

Data

Data Integration Strategies for Time Series Databases

Towards Data Science

FEBRUARY 3, 2023

Exploring popular data integration strategies for TSDBs including ETL, ELT, and CDC Continue reading on Towards Data Science »

Data Integration

Data Integration Database Data Science Data

PySpark and vectorized User-Defined Functions

Waitingforcode

FEBRUARY 3, 2023

The Scala API of Apache Spark SQL has various ways of transforming the data, from the native and User-Defined Function column-based functions, to more custom and row-level map functions. PySpark doesn't have this mapping feature but does have the User-Defined Functions with an optimized version called vectorized UDF!

Scala

Scala SQL Data

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

Data Science

ChatGPT for Beginners

KDnuggets

FEBRUARY 3, 2023

List of best crash courses for ChatGPT.

Process

Table file formats - reading path: Apache Hudi

Waitingforcode

FEBRUARY 3, 2023

After Delta Lake and Apache Iceberg it's time to see the reading part of Apache Hudi. Despite an apparent similarity with the aforementioned table formats, Apache Hudi has an interesting reading specificity related to the different table types.

Certification Courses in Operations Management: How to choose one.

Edureka

FEBRUARY 3, 2023

Earning a specialisation certificate is very important to achieve your career goals. Everyone has both personal and professional objectives in life. In most cases, reaching your personal goals depend greatly on how well you perform professionally. It means that you must get a good job or advance well in your job to reach your life goals. Certain professions are more lucrative than others, and operations management is one such area.

Certification

Certification Management Manufacturing Hospitality

Wildcard path and partitions

Waitingforcode

FEBRUARY 3, 2023

Let's suppose you store the partitioned data under the /data/mydir location. What will be the difference if you read this directory with Apache Spark as /data/mydir/ and /data/mydir/* ? You should find the answer to the question just below.

Data

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Speaker: Anne Steiner and David Laribee

As a concept, Developer Experience (DX) has gained significant attention in the tech industry. It emphasizes engineers’ efficiency and satisfaction during the product development process. As product managers, we need to understand how a good DX can contribute not only to the well-being of our development teams but also to the broader objectives of product success and customer satisfaction.

Engineering

Google Analytics to Azure: 2 Fool-proof Ways to Replicate Your Data

Hevo

FEBRUARY 3, 2023

“Torture the data, and it will confess to anything.”– Ronald Coase, the Nobel prize Laureate Well, a quote that would only be applicable in the field of data analytics, but very powerful, do you agree? It’s pretty relevant while extracting data from Google Analytics to Azure. Because it’s very tricky.

Data Analytics

Data Analytics Data IT

Apache Spark listeners

Waitingforcode

FEBRUARY 3, 2023

Message bus is a common architectural design in the Enterprise Design Patterns. But it's also present at a lower level to enable the event-driven behavior. Apache Spark is not an exception. It uses a publish/subscribe approach in various places.

Architecture

Architecture Designing IT

Learn How OneWeb Delivers Space-Based Connectivity with Snowflake

Snowflake

FEBRUARY 3, 2023

OneWeb and its constellation of 648 satellites help connect the otherwise unreachable. Learn how it uses data mesh—and Snowflake—to help manage its data and unlock untapped potential. OneWeb isn’t your typical communications company. Its constellation of 648 low Earth orbit (LEO) satellites provides high-speed, low-latency connectivity for governments, businesses, and communities almost anywhere on the planet.

BI Architecture Government Cloud

Generated method too long to be JIT compiled

Waitingforcode

FEBRUARY 3, 2023

There are days like that. You inherit a code and it doesn't really work as expected. While digging into issues you find usual weird warnings but also several new things. For me one of these things was the "Generated method too long to be JIT compiled." info message.

Coding

Coding IT

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

Building

Serializers in PySpark

Waitingforcode

FEBRUARY 3, 2023

We've learned in the previous PySpark blog posts about the serialization overhead between the Python application and JVM. An intrinsic actor of this overhead are Python serializers that will be the topic of this article and hopefully, will provide a more complete overview of the Python JVM serialization.

Python

Azure Synapse Link as Hybrid Transactional/Analytical Processing

Waitingforcode

FEBRUARY 3, 2023

I've discovered the term from the title while learning Azure Synapse and Cosmos DB services. I had heard of NoSQL, or even NewSQL, but never of a solution supporting analytical and transactional workloads at once.

NoSQL

NoSQL Process

Shuffle in PySpark

Waitingforcode

FEBRUARY 3, 2023

Shuffle is for me a never-ending story. Last year I spent long weeks analyzing the readers and writers and was hoping for some rest in 2022. However, it didn't happen. My recent PySpark investigation led me to the shuffle.py file and my first reaction was "Oh, so PySpark has its own shuffle mechanism?". Let's check this out!

Apache Airflow 2 overview - part 1

Waitingforcode

FEBRUARY 3, 2023

Apache Airflow 2 introduced a lot of new features. The most visible one is probably a reworked UI but there is more! In this and the next blog post I'll show some of the interesting new Apache Airflow features.

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Speaker: David Bard, Principal at VP Product Coaching

In the fast-paced world of digital innovation, success is often accompanied by a multitude of challenges - like the pitfalls lurking at every turn, threatening to derail the most promising projects. But fret not, this webinar is your key to effective product development! Join us for an enlightening session to empower you to lead your team to greater heights.

Certification

Useful classes for data engineers - Scala & Java

Waitingforcode

FEBRUARY 3, 2023

We all have our habits and as programmers, libraries and frameworks are definitely a part of the group. In this blog post I'll share with you a list of Java and Scala classes I use almost every time in data engineering projects. The part for Python will follow next week!

Scala

Scala Java Data Engineering Data Engineer

Worth reading for data engineers - part 1

Waitingforcode

FEBRUARY 3, 2023

Hi and welcome to the new series. This time I won't blog about my discoveries. Instead, I'm going to see other blog posts from the data engineering space and share some key takeaways with you. I don't know how regular it will be yet but hopefully will be able to share some of the notes every month.

Data Engineering

Data Engineering Data Engineer Engineering Data

Apache Spark as you don't know it

Waitingforcode

FEBRUARY 3, 2023

It's difficult to see all the use cases of a framework. Back in time, when I was a backend engineer, I never succeeded to see all applications of Spring framework. Now, when I'm a data engineer, I feel the same for Apache Spark. Fortunately, the community is there to show me some outstanding features!

IT Data Engineering Data Engineer Engineering

Apache Airflow 2 overview - part 2

Waitingforcode

FEBRUARY 3, 2023

Welcome to the 2nd blog post dedicated to Apache Airflow 2 features. This time it'll be more about custom code you can add to the most recent version.

Coding

How to Build an Experimentation Culture for Data-Driven Product Development

Speaker: Margaret-Ann Seger, Head of Product, Statsig

Experimentation is often seen as an aspirational practice, especially at smaller, fast-moving companies who are strapped for time and resources. So, how can you get your team making decisions in a more data-driven way while continuing to remain lean and maintaining ship velocity? In this webinar, Margaret-Ann Seger, Head of Product at Statsig, will teach you how to build an experimentation culture from the ground-up, graduating from just getting started with data-driven development to operating

Building

Fri.Feb 03, 2023

Getting Started with The Basics of Docker

Table file formats - Change Data Capture: Delta Lake

Webinars

Trending Sources

How to Build and Monitor Systems Using Airflow?

Webinars

Data News — Week 23.05

Get Better Network Graphs & Save Analysts Time

YARN or Kubernetes for Apache Spark?

How to Implement a Federated Learning Project with Healthcare Data

What's new on the cloud for data engineers - part 7 (05-08.2022)

Sign up to get articles personalized to your interests!

More Trending

What's new on the cloud for data engineers - part 7 (05-08.2022)

AI / ML Survival Guide: Conquer DataOps and Data Composability Challenges and Transform into a Truly Data-Driven Organization

Predicate pushdown, why it doesn't work every time?

The Future of Retail: Key Challenges and Opportunities

Table formats - reading: Delta Lake

Understanding User Needs and Satisfying Them

AI is Not Here to Replace Us

Observable metrics

Data Integration Strategies for Time Series Databases

PySpark and vectorized User-Defined Functions

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

ChatGPT for Beginners

Table file formats - reading path: Apache Hudi

Certification Courses in Operations Management: How to choose one.

Wildcard path and partitions

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Google Analytics to Azure: 2 Fool-proof Ways to Replicate Your Data

Apache Spark listeners

Learn How OneWeb Delivers Space-Based Connectivity with Snowflake

Generated method too long to be JIT compiled

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Serializers in PySpark

Azure Synapse Link as Hybrid Transactional/Analytical Processing

Shuffle in PySpark

Apache Airflow 2 overview - part 1

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Useful classes for data engineers - Scala & Java

Worth reading for data engineers - part 1

Apache Spark as you don't know it

Apache Airflow 2 overview - part 2

How to Build an Experimentation Culture for Data-Driven Product Development

Stay Connected