Top Data Engineering Digest Data Analysis Tools High Quality Data Content for December, 2022

December, 2022

Data Pipeline Design Patterns - #1. Data flow patterns

Start Data Engineering

DECEMBER 11, 2022

1. Introduction 2. Source & Sink 2.1. Source Replayability 2.2. Source Ordering 2.3. Sink Overwritability 3. Data pipeline patterns 3.1. Extraction patterns 3.1.1. Time ranged 3.1.2. Full Snapshot 3.1.3. Lookback 3.1.4. Streaming 3.2. Behavioral 3.2.1. Idempotent 3.2.2. Self-healing 3.3. Structural 3.3.1. Multi-hop pipelines 3.3.2. Conditional/ Dynamic pipelines 3.3.3.

Data Pipeline

Data Pipeline Designing Data

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Confessions of a Data Guy

DECEMBER 10, 2022

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, […] The post Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion.

Data

Data IT Big Data Data Engineering

Join 16,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

A Return to the Office (RTO) Wave?

The Pragmatic Engineer

DECEMBER 8, 2022

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe here. On Thursday, 29 November, Snap CEO Evan Spiegel, sent an email announcing Snap will mandate 4 days/week in the office, starting from January.

Software Engineer

Software Engineer Software Engineering Medical Insurance

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Data News — must-read 2022 articles

Christophe Blefari

DECEMBER 30, 2022

kitsch moment, from me to you ( credits ) Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year. 💡 You can also read the 2021's must-read that I've done one year and half ago or how to learn data engineering that contains key articles to understand the field.

Data Warehouse

Data Warehouse BI Data Machine Learning

Navigating the Future: Generative AI, Application Analytics, and Data

Generative AI is upending the way product developers & end-users alike are interacting with data. Despite the potential of AI, many are left with questions about the future of product development: How will AI impact my business and contribute to its success? What can product managers and developers expect in the future with the widespread adoption of AI?

Data

Should We Get Rid Of ETLs?

Seattle Data Guy

DECEMBER 29, 2022

AWS has jumped on the bandwagon of removing the need for ETLs. Snowflake announced this both with their hybrid tables and their partnership with Salesforce. Now, I do take a little issue with the naming “Zero ETLs”. Because at the very surface the functionality described is often closer to a zero integration future, which probably… Read more The post Should We Get Rid Of ETLs?

AWS

AWS Consulting Big Data Data

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Data Engineering Podcast

DECEMBER 28, 2022

Summary With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term i

Data Lake

Data Lake Data Warehouse Data Pipeline MongoDB

Brief History of Data Engineering

Jesse Anderson

DECEMBER 12, 2022

In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and GFS in 2004. They published the papers for them in the same year. Doug Cutting took those papers and created Apache Hadoop in 2005. Cloudera was started in 2008, and HortonWorks started in 2011.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

More Trending

Brief History of Data Engineering

Jesse Anderson

DECEMBER 12, 2022

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Confessions of a Data Guy

DECEMBER 29, 2022

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that […] The post I asked ChatGPT to write a blog post about Data Engineering.

Data Engineering

Data Engineering Data Engineer Engineering IT

What Can AI-Powered RPA and IA Mean For Businesses?

KDnuggets

DECEMBER 19, 2022

RPA and IA have stunned the business world by availing impressive, intelligent automation capabilities for scales of businesses across industries, which we'll know in this blog.

How to manage and schedule dbt

Christophe Blefari

DECEMBER 19, 2022

Last week dbt Labs decided to change the pricing of their Cloud offering. I've already analysed this in week #22.50 of the Data News. In a nutshell, dbt Cloud pricing is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project.

Management

Management Pipeline-centric Database-centric SQL

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Seattle Data Guy

DECEMBER 19, 2022

By Reseun McClendon Today, your enterprise must effectively collect, store, and integrate data from disparate sources to both provide operational and analytical benefits. Whether its helping increase revenue by finding new customers or reducing costs, all of it starts with data. Data analysts, data scientists, engineers, and managers all require a robust data storage solution for… Read more The post Data warehouses vs Data Lakes vs Databases – Which One Do You Need appeared first on

Data Lake

Data Lake Data Warehouse Database Data Storage

Get Better Network Graphs & Save Analysts Time

Many organizations today are unlocking the power of their data by using graph databases to feed downstream analytics, enahance visualizations, and more. Yet, when different graph nodes represent the same entity, graphs get messy. Watch this essential video with Senzing CEO Jeff Jonas on how adding entity resolution to a graph database condenses network graphs to improve analytics and save your analysts time.

Database

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Summary Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective.

Management

Management Metadata Business Intelligence Data Lake

Building a Telegram Bot Powered by Apache Kafka and ksqlDB

Confluent

DECEMBER 2, 2022

ksqlDB use case: see how apps can use ksqlDB to ingest, filter, enrich, aggregate, and query data directly with Kafka—no complex architectures or data stores needed.

Kafka

Kafka Building Architecture Data

What is Apache Arrow? Asking for a friend.

Confessions of a Data Guy

DECEMBER 27, 2022

We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping […] The post What is Apache Arrow?

IT Data Big Data Data Engineering

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

KDnuggets

DECEMBER 30, 2022

Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.

Data Science

Data Science Deep Learning Data IT

Understanding User Needs and Satisfying Them

Speaker: Scott Sehlhorst

We know we want to create products which our customers find to be valuable. Whether we label it as customer-centric or product-led depends on how long we've been doing product management. There are three challenges we face when doing this. The obvious challenge is figuring out what our users need; the non-obvious challenges are in creating a shared understanding of those needs and in sensing if what we're doing is meeting those needs.

Certification

Data News — Week 22.50

Christophe Blefari

DECEMBER 16, 2022

Prepping me to deliver Christmas' Data News ( credits ) Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.

Kafka

Kafka Data SQL Cloud

Reducing Data Analytics Costs In 2023 – Doing More With Less

Seattle Data Guy

DECEMBER 10, 2022

If you haven’t started looking for ways to improve your data analytics budget for 2023, then you’re probably already behind. The truth is that between all of the various economic indicators and investor letters, everyone is looking to improve audit all parts of their business. Especially where there has likely been bloat. One of those… Read more The post Reducing Data Analytics Costs In 2023 – Doing More With Less appeared first on Seattle Data Guy.

Data Analytics

Data Analytics Data Consulting Big Data

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Data Engineering Podcast

DECEMBER 25, 2022

Summary Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies.

Machine Learning

Machine Learning Systems Data Lake Data Warehouse

Broadcom Modernizes Machine Learning and Anomaly Detection with ksqlDB

Confluent

DECEMBER 2, 2022

Broadcom's Mainframe Operational Intelligence Product (MOI) collects and analyzes data at mass scale, using ksqlDB to improve anomaly detection and custom alarm filtering.

Machine Learning

Machine Learning Data

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

Data Science

Why Data Migrations Suck.

Confessions of a Data Guy

DECEMBER 5, 2022

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow […] The post Why Data Migrations Suck. appeared first on Confessions of a Data Guy.

Data

Data IT Big Data Data Engineering

More Data Science Cheatsheets

KDnuggets

DECEMBER 30, 2022

It's time again to look at some data science cheatsheets. Here you can find a short selection of such resources which can cater to different existing levels of knowledge and breadth of topics of interest.

Data Science

Data Science Data IT

Data News — Week 22.49

Christophe Blefari

DECEMBER 9, 2022

This is what we call a Chat in French ( credits ) Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me.

SQL

SQL Data Data Science Metadata

Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

U-Next

DECEMBER 21, 2022

Introduction. Are you a Cybersecurity enthusiast looking to know the latest trends and goings in the cybersecurity industry? Or are you just a tech enthusiast who likes to be updated with the ongoings around them? Then you are at the perfect place. As another year comes to an end, we decided the best way to look back was to revisit the most popular and sought-after blogs of Cybersecurity and list the same for all our Cybersecurity enthusiasts.

Education

Education Government Project Engineering

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

Building

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing

Metadata

Metadata Business Intelligence Data Lake BI

From Eager to Smarter in Apache Kafka Consumer Rebalances

Confluent

DECEMBER 2, 2022

Major improvements to the Kafka consumer, Streams, and ksqlDB for incremental cooperative rebalancing while maintaining at-least-once and exactly-once guarantees.

Kafka

Kafka Process

Safety First: Using vehicle data to make us all better drivers

Teradata

DECEMBER 15, 2022

Vehicle data is invaluable in improving the safety & safe operation of vehicles for their occupants & other drivers. The next gen of vehicles will use real-time analysis to make driving even safer.

Data

How To Overcome The Fear of Math and Learn Math For Data Science

KDnuggets

DECEMBER 16, 2022

Many aspiring Data Scientists, especially when self-learning, fail to learn the necessary math foundations. These recommendations for learning approaches along with references to valuable resources can help you overcome a personal sense of not being "the math type" or belief that you "always failed in math.".

Data Science

Data Science Data

How Embedded Analytics Gets You to Market Faster with a SAAS Offering

Start-ups & SMBs launching products quickly must bundle dashboards, reports, & self-service analytics into apps. Customers expect rapid value from your product (time-to-value), data security, and access to advanced capabilities. Traditional Business Intelligence (BI) tools can provide valuable data analysis capabilities, but they have a barrier to entry that can stop small and midsize businesses from capitalizing on them.

Data Analysis

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Cloudera has been providing enterprise support for Apache NiFi since 2015, helping hundreds of organizations take control of their data movement pipelines on premises and in the public cloud. Working with these organizations has taught us a lot about the needs of developers and administrators when it comes to developing new dataflows and supporting them in mission-critical production environments. .

Designing

Designing Coding Google Cloud AWS

Making GHC faster at emitting code

Tweag

DECEMBER 21, 2022

One common complaint from industrial users of Haskell is that of compilation times: they are sometimes painfully slow. Some of that slowness is difficult to avoid—no matter how you slice it, typechecking and optimizing Haskell code takes a lot of work—but nobody would argue that there is not ample room for improvement. For the past few months, Krzysztof Gogolewski and I have had the opportunity to work with Mercury to identify what some of those improvements might be, and I am pleased to report

Coding

Coding Architecture Programming Designing

Data Catalog - A Broken Promise

Data Engineering Weekly

DECEMBER 29, 2022

Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix. I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management , to be precise.

Metadata

Metadata Data Warehouse ETL Tools Data Workflow

ksqlDB Execution Plans: Move Fast But Don’t Break Things

Confluent

DECEMBER 2, 2022

Build fast, break nothing. Learn about the unique challenges Confluent's engineering team has faced building ksqlDB and continuously shipping the latest, greatest features.

Building

Building Engineering Process

Embedding BI: Architectural Considerations and Technical Requirements

While data platforms, artificial intelligence (AI), machine learning (ML), and programming platforms have evolved to leverage big data and streaming data, the front-end user experience has not kept up. Holding onto old BI technology while everything else moves forward is holding back organizations. Traditional Business Intelligence (BI) aren’t built for modern data platforms and don’t work on modern architectures.

December, 2022

Data Pipeline Design Patterns - #1. Data flow patterns

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Webinars

Trending Sources

A Return to the Office (RTO) Wave?

Webinars

Data News — must-read 2022 articles

Navigating the Future: Generative AI, Application Analytics, and Data

Should We Get Rid Of ETLs?

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Brief History of Data Engineering

Sign up to get articles personalized to your interests!

More Trending

Brief History of Data Engineering

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

What Can AI-Powered RPA and IA Mean For Businesses?

How to manage and schedule dbt

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Get Better Network Graphs & Save Analysts Time

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Building a Telegram Bot Powered by Apache Kafka and ksqlDB

What is Apache Arrow? Asking for a friend.

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

Understanding User Needs and Satisfying Them

Data News — Week 22.50

Reducing Data Analytics Costs In 2023 – Doing More With Less

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Broadcom Modernizes Machine Learning and Anomaly Detection with ksqlDB

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Why Data Migrations Suck.

More Data Science Cheatsheets

Data News — Week 22.49

Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

From Eager to Smarter in Apache Kafka Consumer Rebalances

Safety First: Using vehicle data to make us all better drivers

How To Overcome The Fear of Math and Learn Math For Data Science

How Embedded Analytics Gets You to Market Faster with a SAAS Offering

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Making GHC faster at emitting code

Data Catalog - A Broken Promise

ksqlDB Execution Plans: Move Fast But Don’t Break Things

Embedding BI: Architectural Considerations and Technical Requirements

Stay Connected