December, 2022

article thumbnail

Data Pipeline Design Patterns - #1. Data flow patterns

Start Data Engineering

1. Introduction 2. Source & Sink 2.1. Source Replayability 2.2. Source Ordering 2.3. Sink Overwritability 3. Data pipeline patterns 3.1. Extraction patterns 3.1.1. Time ranged 3.1.2. Full Snapshot 3.1.3. Lookback 3.1.4. Streaming 3.2. Behavioral 3.2.1. Idempotent 3.2.2. Self-healing 3.3. Structural 3.3.1. Multi-hop pipelines 3.3.2. Conditional/ Dynamic pipelines 3.3.3.

article thumbnail

A Return to the Office (RTO) Wave?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe here. On Thursday, 29 November, Snap CEO Evan Spiegel, sent an email announcing Snap will mandate 4 days/week in the office, starting from January.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data News — must-read 2022 articles

Christophe Blefari

kitsch moment, from me to you ( credits ) Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year. 💡 You can also read the 2021's must-read that I've done one year and half ago or how to learn data engineering that contains key articles to understand the field.

Data 130
article thumbnail

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Confessions of a Data Guy

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that […] The post I asked ChatGPT to write a blog post about Data Engineering.

article thumbnail

Should We Get Rid Of ETLs?

Seattle Data Guy

AWS has jumped on the bandwagon of removing the need for ETLs. Snowflake announced this both with their hybrid tables and their partnership with Salesforce. Now, I do take a little issue with the naming “Zero ETLs”. Because at the very surface the functionality described is often closer to a zero integration future, which probably… Read more The post Should We Get Rid Of ETLs?

AWS 130
article thumbnail

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Data Engineering Podcast

Summary With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term i

Data 130

More Trending

article thumbnail

What Can AI-Powered RPA and IA Mean For Businesses?

KDnuggets

RPA and IA have stunned the business world by availing impressive, intelligent automation capabilities for scales of businesses across industries, which we'll know in this blog.

145
145
article thumbnail

How to manage and schedule dbt

Christophe Blefari

Last week dbt Labs decided to change the pricing of their Cloud offering. I've already analysed this in week #22.50 of the Data News. In a nutshell, dbt Cloud pricing is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project.

article thumbnail

What is Apache Arrow? Asking for a friend.

Confessions of a Data Guy

We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping […] The post What is Apache Arrow?

Data 130
article thumbnail

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Seattle Data Guy

By Reseun McClendon Today, your enterprise must effectively collect, store, and integrate data from disparate sources to both provide operational and analytical benefits. Whether its helping increase revenue by finding new customers or reducing costs, all of it starts with data. Data analysts, data scientists, engineers, and managers all require a robust data storage solution for… Read more The post Data warehouses vs Data Lakes vs Databases – Which One Do You Need appeared first on

Data Lake 130
article thumbnail

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

Summary Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective.

Metadata 130
article thumbnail

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow. That article was a deep dive into one of the more technical aspects of Dataflow and didn’t properly introduce this tool in the first place.

article thumbnail

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

KDnuggets

Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.

article thumbnail

Data News — Week 22.50

Christophe Blefari

Prepping me to deliver Christmas' Data News ( credits ) Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.

Data 130
article thumbnail

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Confessions of a Data Guy

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, […] The post Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion.

Data 130
article thumbnail

Reducing Data Analytics Costs In 2023 – Doing More With Less

Seattle Data Guy

If you haven’t started looking for ways to improve your data analytics budget for 2023, then you’re probably already behind. The truth is that between all of the various economic indicators and investor letters, everyone is looking to improve audit all parts of their business. Especially where there has likely been bloat. One of those… Read more The post Reducing Data Analytics Costs In 2023 – Doing More With Less appeared first on Seattle Data Guy.

article thumbnail

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Data Engineering Podcast

Summary Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies.

article thumbnail

Building a Telegram Bot Powered by Apache Kafka and ksqlDB

Confluent

ksqlDB use case: see how apps can use ksqlDB to ingest, filter, enrich, aggregate, and query data directly with Kafka—no complex architectures or data stores needed.

Kafka 144
article thumbnail

More Data Science Cheatsheets

KDnuggets

It's time again to look at some data science cheatsheets. Here you can find a short selection of such resources which can cater to different existing levels of knowledge and breadth of topics of interest.

article thumbnail

Data News — Week 22.49

Christophe Blefari

This is what we call a Chat in French ( credits ) Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me.

Data 130
article thumbnail

Safety First: Using vehicle data to make us all better drivers

Teradata

Vehicle data is invaluable in improving the safety & safe operation of vehicles for their occupants & other drivers. The next gen of vehicles will use real-time analysis to make driving even safer.

Data 105
article thumbnail

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

Cloudera has been providing enterprise support for Apache NiFi since 2015, helping hundreds of organizations take control of their data movement pipelines on premises and in the public cloud. Working with these organizations has taught us a lot about the needs of developers and administrators when it comes to developing new dataflows and supporting them in mission-critical production environments. .

Designing 101
article thumbnail

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing

Data 130
article thumbnail

Broadcom Modernizes Machine Learning and Anomaly Detection with ksqlDB

Confluent

Broadcom's Mainframe Operational Intelligence Product (MOI) collects and analyzes data at mass scale, using ksqlDB to improve anomaly detection and custom alarm filtering.

article thumbnail

How To Overcome The Fear of Math and Learn Math For Data Science

KDnuggets

Many aspiring Data Scientists, especially when self-learning, fail to learn the necessary math foundations. These recommendations for learning approaches along with references to valuable resources can help you overcome a personal sense of not being "the math type" or belief that you "always failed in math.".

article thumbnail

Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

Analytics Training

Introduction. Are you a Cybersecurity enthusiast looking to know the latest trends and goings in the cybersecurity industry? Or are you just a tech enthusiast who likes to be updated with the ongoings around them? Then you are at the perfect place. As another year comes to an end, we decided the best way to look back was to revisit the most popular and sought-after blogs of Cybersecurity and list the same for all our Cybersecurity enthusiasts.

Data 90
article thumbnail

Teradata VantageCloud Lake + Vcinity Technology: Taking the Management Costs out of Network Latency

Teradata

Teradata VantageCloud Lake + Vcinity is perfect for on-premises, hybrid, & multi-cloud solutions where long network latency might keep an enterprise from leveraging access to their sensitive data.

article thumbnail

Go Hybrid & Multi-Cloud or Don’t Go

Cloudera

Over the past few months industry analysts have been making some pretty controversial recommendations for data management in the cloud. For a thoughtful and entertaining analysis, I strongly recommend you spend a few minutes watching the keynote session by Pat Moorhead, CEO Moor Insights & Strategy, at the Evolve 2022 Data event in New York. His takeaway: “The world is very much going to be hybrid and multi-cloud.

Cloud 96
article thumbnail

Data Catalog - A Broken Promise

Data Engineering Weekly

Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix. I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management , to be precise.

article thumbnail

From Eager to Smarter in Apache Kafka Consumer Rebalances

Confluent

Major improvements to the Kafka consumer, Streams, and ksqlDB for incremental cooperative rebalancing while maintaining at-least-once and exactly-once guarantees.

Kafka 138
article thumbnail

We Don’t Need Data Scientists, We Need Data Engineers

KDnuggets

As more people are entering the field of Data Science and more companies are hiring for data-centric roles, what type of jobs are currently in highest demand? There is so much data in the world, and it just keeps flooding in, it now looks like companies are targeting those who can engineer that data more than those who can only model the data.

article thumbnail

Why Picnic picked Java

Picnic Engineering

Picking a tech stack for your startup isn’t something to do lightly. It’s a choice that will shape the future in many ways: how will the tech enable your emerging product and business, what talent can you attract, and how future-proof is the tech stack? When Picnic launched as the first app-only supermarket back in 2015 in The Netherlands, the tech landscape looked markedly different from today.

Java 59
article thumbnail

Five Challenges to Building an Isomorphic JavaScript Library

DoorDash Engineering

Building software today can require working on the server side and client side, but building isomorphic JavaScript libraries can be a challenge if unaware of some particular issues, which can involve picking the right dependencies and selectively importing them among others. For context, Isomorphic JavaScript, also known as Universal JavaScript, is JavaScript code that can run in any environment — including Node.js or web browser.

article thumbnail

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Cloudera has been recognized in this cloud DBMS report since its inception in 2020. This year we’ve been named a Leader. This validates our significant momentum in global enterprises. And together, with our recent recognition in the Gartner Peer Insights Customer Choice Distinction for Cloud DBMS , cements our position as an industry leader.

Cloud 93