Sat.Dec 10, 2022 - Fri.Dec 16, 2022

article thumbnail

Data Pipeline Design Patterns - #1. Data flow patterns

Start Data Engineering

1. Introduction 2. Source & Sink 2.1. Source Replayability 2.2. Source Ordering 2.3. Sink Overwritability 3. Data pipeline patterns 3.1. Extraction patterns 3.1.1. Time ranged 3.1.2. Full Snapshot 3.1.3. Lookback 3.1.4. Streaming 3.2. Behavioral 3.2.1. Idempotent 3.2.2. Self-healing 3.3. Structural 3.3.1. Multi-hop pipelines 3.3.2. Conditional/ Dynamic pipelines 3.3.3.

article thumbnail

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Confessions of a Data Guy

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, […] The post Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion.

Data 147
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data News — Week 22.50

Christophe Blefari

Prepping me to deliver Christmas' Data News ( credits ) Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.

Kafka 130
article thumbnail

Brief History of Data Engineering

Jesse Anderson

In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and GFS in 2004. They published the papers for them in the same year. Doug Cutting took those papers and created Apache Hadoop in 2005. Cloudera was started in 2008, and HortonWorks started in 2011.

article thumbnail

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Speaker: Anne Steiner and David Laribee

As a concept, Developer Experience (DX) has gained significant attention in the tech industry. It emphasizes engineers’ efficiency and satisfaction during the product development process. As product managers, we need to understand how a good DX can contribute not only to the well-being of our development teams but also to the broader objectives of product success and customer satisfaction.

article thumbnail

Reducing Data Analytics Costs In 2023 – Doing More With Less

Seattle Data Guy

If you haven’t started looking for ways to improve your data analytics budget for 2023, then you’re probably already behind. The truth is that between all of the various economic indicators and investor letters, everyone is looking to improve audit all parts of their business. Especially where there has likely been bloat. One of those… Read more The post Reducing Data Analytics Costs In 2023 – Doing More With Less appeared first on Seattle Data Guy.

article thumbnail

How To Overcome The Fear of Math and Learn Math For Data Science

KDnuggets

Many aspiring Data Scientists, especially when self-learning, fail to learn the necessary math foundations. These recommendations for learning approaches along with references to valuable resources can help you overcome a personal sense of not being "the math type" or belief that you "always failed in math.".

More Trending

article thumbnail

Safety First: Using vehicle data to make us all better drivers

Teradata

Vehicle data is invaluable in improving the safety & safe operation of vehicles for their occupants & other drivers. The next gen of vehicles will use real-time analysis to make driving even safer.

Data 105
article thumbnail

Put Your Data to Work: Top 5 Data Technology Trends for 2023

Confluent

As businesses move to meet modern demands, these technologies ensure not only a digital transformation, but data transformation, with new use cases surrounding real-time data.

article thumbnail

5 Python Projects for Data Science Portfolio

KDnuggets

Get more experience by working on web scraping, data analytics, time-series forecasting, machine learning, and deep learning projects.

Portfolio 140
article thumbnail

Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

Data Engineering Podcast

Preamble This is a cross-over episode from our new show The Machine Learning Podcast , the show about going from idea to production with machine learning. Summary Data is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information.

article thumbnail

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

article thumbnail

Career stories: Next-gen systems, servers, and SREs

LinkedIn Engineering

Saira joined our Bangalore site reliability engineering (SRE) team to tackle large-scale, site engineering challenges and grow. She highlights for us the impactful work she found here �����from ushering in LinkedIn���s next-generation, server query system that runs over a fleet of 350,000 servers, to mentoring the next generation of female engineers: In my engineering career, I���ve always followed the path less taken.

Systems 55
article thumbnail

Using rideshare data to evaluate racial bias in the issuance of speeding citations

Lyft Engineering

The disproportionate impact of policing on communities of color¹ is a central social and policy concern in the United States, and a topic of intense study in academia. Lyft is uniquely positioned to contribute to this discourse and the academic research and literature on this topic using data from the large number of trips on our rideshare platform.

article thumbnail

Top 5 NLP Cheat Sheets for Beginners to Professional

KDnuggets

The cheat sheets cover various NLP techniques, tasks, algorithms, frameworks, and analytics.

Algorithm 160
article thumbnail

Beyond the Hype: Blockchain is dead, long live blockchain by Colin Eberhardt

Scott Logic

In this episode, I’m joined by colleagues Oliver Cronk, Peter Chamberlin and Chris Price for a lively discussion about blockchain. We start by looking at the mechanics of bitcoin, and the economic incentive model formed by proof of work consensus. From there, we discuss enterprise or permission blockchain, which leads us to discuss some specific use cases, for example the oil market supply-chain challenges.

article thumbnail

How to Build an Experimentation Culture for Data-Driven Product Development

Speaker: Margaret-Ann Seger, Head of Product, Statsig

Experimentation is often seen as an aspirational practice, especially at smaller, fast-moving companies who are strapped for time and resources. So, how can you get your team making decisions in a more data-driven way while continuing to remain lean and maintaining ship velocity? In this webinar, Margaret-Ann Seger, Head of Product at Statsig, will teach you how to build an experimentation culture from the ground-up, graduating from just getting started with data-driven development to operating

article thumbnail

Emerging Technologies: What Did Everyone Want To Know In 2022?

U-Next

Exploring the unknown and achieving new milestones every other day seems to be the norm of the 21 st century. Even at the peak of technological innovation the human’s hunger or determination to discover and innovate things never heard of before does not seem to be even mildly deterred even by a global pandemic. In fact, the pandemic only made us realize how much we do not know about the world we live in and how much more there is to know and discover.

article thumbnail

Using the Amazon MSK Native Connector to Simplify Real-Time Analytics on Kafka

Rockset

Rockset’s native connector for Amazon Managed Streaming for Apache Kafka (MSK) makes it simpler and faster to ingest streaming data for real-time analytics. Amazon MSK is a fully managed AWS service that gives users the ability to build and run applications using Apache Kafka. Amazon MSK provides control-plane operations such as creating and deleting clusters, while allowing users to use Apache Kafka data-plane operations for producing and consuming data.

Kafka 52
article thumbnail

Markdown Cheatsheet

KDnuggets

Markdown is a lightweight markup language for creating formatted text using a plain-text editor. Grab this handy reference sheet to make certain you know how to implement what you need to, when you want to!

article thumbnail

Java vs. C++: Which Language Should You Choose for Your 2023 Project?

Trio

Both Java and C++ are equally renowned when it comes to building modern, industry-leading applications and platforms. Both have existed for decades now, share many similarities in syntax, and support object-oriented programming (OOP). In fact, Java was an extension of the C language, intended to serve a broader audience than C++.

Java 52
article thumbnail

Entity Resolution Checklist: What to Consider When Evaluating Options

Are you trying to decide which entity resolution capabilities you need? It can be confusing to determine which features are most important for your project. And sometimes key features are overlooked. Get the Entity Resolution Evaluation Checklist to make sure you’ve thought of everything to make your project a success! The list was created by Senzing’s team of leading entity resolution experts, based on their real-world experience.

article thumbnail

Snowflake and S3 Data Lake

Cloudyard

Read Time: 4 Minute, 23 Second During this post we will discuss how AWS S3 service and Snowflake integration can be used as Data Lake in current organizations. How customer has migrated On Premises EDW to Snowflake to leverage snowflake Data Lake capabilities. Moreover, We will use the below architecture to showcase the Demo where convert the existing Data Lake to Snowflake Deployment.

article thumbnail

How to build a communication microservice to send text messages using Twilio and Express?

Workfall

Reading Time: 7 minutes Twilio is all about empowering #communication in a convenient and timely manner. In this blog, we will demonstrate how to build a communication microservice to send text messages using Twilio and Express. Let’s get started! Required Installations: Node.js: It is a JavaScript runtime environment that executes JavaScript code outside the browsers.

article thumbnail

Top Posts December 5-11: 4 Useful Intermediate SQL Queries for Data Science

KDnuggets

4 Useful Intermediate SQL Queries for Data Science • How to Select Rows and Columns in Pandas Using [ ],loc, iloc,at and.iat • 3 Free Machine Learning Courses for Beginners • 7 Essential Cheat Sheets for Data Engineering • 7 Techniques to Handle Imbalanced Data.

article thumbnail

The Top 25 Data Engineering Influencers and Content Creators on LinkedIn

Databand.ai

The Top 25 Data Engineering Influencers and Content Creators on LinkedIn Ryan Yackel 2022-12-13 10:23:19 Interested in data engineering? You’ve come to the right place. Whether you’re a data engineering pro looking to stay up to date on the latest trends or new to the space and want to learn more, following the right leaders and joining the right conversations can make all the difference when it comes to plugging into the data engineering community.

article thumbnail

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Speaker: David Bard, Principal at VP Product Coaching

In the fast-paced world of digital innovation, success is often accompanied by a multitude of challenges - like the pitfalls lurking at every turn, threatening to derail the most promising projects. But fret not, this webinar is your key to effective product development! Join us for an enlightening session to empower you to lead your team to greater heights.

article thumbnail

The Changing Role of Finance

Teradata

Over the years, the finance function has evolved from pure accounting to being a catalyst of change. We are at the table, not just talking about the numbers, but influencing the business strategy.

Finance 52
article thumbnail

The Snowflake Data Experience: A Survey of Snowflake Users and How They Optimize Their Data

Acceldata

Clearly, cost is top of mind for most Snowflake data teams. What’s notable about this particular metric is that other top concerns – data quality and performance – are both intrinsically related to cost.

Data 52
article thumbnail

How To Collect Data For Customer Sentiment Analysis

KDnuggets

Customer sentiment analysis involves collecting, analyzing, and leveraging data to understand customers' feelings. This article focuses on how to collect data for customer sentiment analysis.

Data 108
article thumbnail

Query Acceleration Service in Snowflake

Cloudyard

Read Time: 4 Minute, 28 Second During this post we will discuss one of the important Snowflake Capability i.e. Query Acceleration Service a.k.a. (QAS). When statement is submitted to a warehouse, Snowflake allocates resources for executing the statement. If there aren’t enough resources available, the statement is queued or additional warehouses are started, depending on the warehouse.

article thumbnail

Reimagined: Building Products with Generative AI

“Reimagined: Building Products with Generative AI” is an extensive guide for integrating generative AI into product strategy and careers featuring over 150 real-world examples, 30 case studies, and 20+ frameworks, and endorsed by over 20 leading AI and product executives, inventors, entrepreneurs, and researchers.

article thumbnail

Streaming in Production: Collected Best Practices

databricks

Releasing any data pipeline or application into a production state requires planning, testing, monitoring, and maintenance. Streaming pipelines are no different in this.

article thumbnail

Building Great Data Products Starts with Data Quality and Data Reliability

Acceldata

To effectively productize data, organizations need to first ensure that their data delivery, availability, and quality is reliable. Learn more here about data reliability.

article thumbnail

From Data to Verse: KDnuggets and ChatGPT in Conversation

KDnuggets

KDnuggets recently had the opportunity to sit down with newly-released acclaimed artificial intelligence ChatGTP from OpenAI. What we found during the course of conversation was both interesting and surprising. Read on to find out what ChatGPT knew about data science and much more.

article thumbnail

Introducing the AWS S3 Data Source: Power customer-facing analytics from Parquet files in your S3 bucket | Propel Data Analytics Blog

Propel Data

You can now power customer-facing analytics use cases such as insights dashboards, product usage reporting, or analytics APIs with Parquet f

AWS 52
article thumbnail

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.