Sat.Oct 09, 2021 - Fri.Oct 15, 2021

article thumbnail

How to add tests to your data pipelines

Start Data Engineering

Introduction Testing your data pipeline 1. End-to-end system testing 2. Data quality testing 3. Monitoring and alerting 4. Unit and contract testing Conclusion Further reading Introduction Testing data pipelines are different from testing other applications, like a website backend.

article thumbnail

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity.

Metadata 100
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

CDP Public Cloud Regional Control Plane is Now Available in Australia and Europe

Cloudera

We’re excited to announce the availability of CDP Public Cloud Regional Control Plane in Australia and Europe. This addition will extend CDP Hybrid capabilities to customers in industries with strict data protection requirements by allowing them to govern their data entirely in-region. CDP’s public cloud architecture is designed to ensure that customer data remains within a customer’s environment at all times, helping enable companies to meet their data protection obligations, including any rest

Cloud 94
article thumbnail

Apache Kafka and R: Real-Time Prediction and Model (Re)training

Confluent

Machine learning on real-time data is a powerful combination because you gain direct insights into your data, can make powerful decisions, and consequently improve your business processes and outcomes. It […].

Kafka 89
article thumbnail

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Speaker: Anne Steiner and David Laribee

As a concept, Developer Experience (DX) has gained significant attention in the tech industry. It emphasizes engineers’ efficiency and satisfaction during the product development process. As product managers, we need to understand how a good DX can contribute not only to the well-being of our development teams but also to the broader objectives of product success and customer satisfaction.

article thumbnail

10 Skills to Ace Your Data Engineering Interviews

Start Data Engineering

Introduction Skills 1. SQL 2. Python 3. Leetcode: data structures and algorithms 4. Data modeling 4.1 Data warehousing 4.2 OLTP 5. Data pipelines 6. Distributed system fundamentals 7. Event streaming 8. System design 9. Business questions 10. Cloud computing 11. Probabilistic data structures (optional) Interview prep, the TL;DR version Conclusion Introduction Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts?

article thumbnail

The Data Janitor Letters - September 2021

Pipeline Data Engineering

Data engineering salon. News and interesting reads about the world of data. Cloudflare’s Disruption Ben Thompson, Stratechery S3’s margin is R2’s opportunity. Operations is not Developer IT Mathew Duggan, DevOps Manager, GAN Integrity It's not their fault, they were told this was easy. How Big Tech Runs Tech Projects and the Curious Absence of Scrum Gergely Orosz A survey of how tech projects run across the industry highlights Scrum being absent from Big Tech.

More Trending

article thumbnail

Confluent’s Oracle CDC Connector Now Supports Oracle Database 19c

Confluent

Many Oracle Database customers currently still leverage Oracle 12c or 18c in their production environments, with some even using Oracle 11g. Most of these customers have moved to 19c or […].

article thumbnail

Whats the difference between ETL & ELT?

Start Data Engineering

1. Introduction 2. E-T-L definition 3. Differences between ETL & ELT 4. Conclusion 5. Further reading 1. Introduction If you are a student, analyst, engineer, or anyone working with data pipelines, you would have heard of ETL and ELT architecture. If you have questions like What is the difference between ETL & ELT? Should I use ETL or ELT pattern for my data pipeline?

article thumbnail

What Is a Cloud Database? IaaS, PaaS, SaaS and DBaaS Explained

Rockset

For many organizations, the advantages of a cloud-based database are clear. They offer scalability, security, and availability. There can also be cost savings over custom and on-premises database solutions. However, not all cloud databases are created equal. Terms like IaaS, PaaS and SaaS have traditionally been used to describe various levels of cloud computing, but how do they apply to cloud databases?

article thumbnail

#ClouderaLife Spotlight: Bryan Bottinelli, Commercial Account Executive

Cloudera

As we continue to celebrate Hispanic Heritage Month, we’d like to shine a spotlight on yet another one of Cloudera’s high performing employees who contributes to the culture and community both in and outside of the Cloudera walls. . Meet Bryan Bottinelli, a 2 year Clouderan and first generation American with roots in Colombia and Chile. . As a Commercial Account Manager, he spends his work days growing the adoption of Cloudera Data Platform (CDP) in the Great Lakes region.

article thumbnail

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

article thumbnail

15 Machine Learning Regression Projects Ideas for Beginners

ProjectPro

Linear and logistic regression models in machine learning mark most beginners’ first steps into the world of machine learning. Whether you want to understand the effect of IQ and education on earnings or analyze how smoking cigarettes and drinking coffee are related to mortality, all you need is to understand the concepts of linear and logistic regression.

article thumbnail

What are Common Table Expressions(CTEs) and when to use them?

Start Data Engineering

Introduction Setup Common Table Expressions (CTEs) Performance comparison CTE Subquery and derived tables Temp table Trade-offs Tear down Conclusion References Introduction If you are a student, analyst, engineer, or anyone in the data space and are Wondering what CTEs are? Trying to understand CTE performance Then this post is for you. In this post, we go over what CTEs are and compare their performance to the subquery, derived table, and temp table.

article thumbnail

Firm Foundations Needed for 5G Exploration

Teradata

Telcos, their customers, & a range of enterprises are entering a period of experimentation with 5G. The opportunities for innovation & growth are immense – but the costs & risks are outsized too.

52
article thumbnail

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Cloudera

Quite often, the digital natives of the family — you — have to explain to the analog fans of the family what PDFs are, how to use a hashtag, a phone camera, or a remote. Imagine if you had to explain what machine learning is and how to use it. There’s no need to panic. Cloudera produced a series of ebooks — Production Machine Learning For Dummies , Apache NiFi For Dummies , and Apache Flink For Dummies (coming soon) — to help simplify even the most complex tech topics.

article thumbnail

How to Build an Experimentation Culture for Data-Driven Product Development

Speaker: Margaret-Ann Seger, Head of Product, Statsig

Experimentation is often seen as an aspirational practice, especially at smaller, fast-moving companies who are strapped for time and resources. So, how can you get your team making decisions in a more data-driven way while continuing to remain lean and maintaining ship velocity? In this webinar, Margaret-Ann Seger, Head of Product at Statsig, will teach you how to build an experimentation culture from the ground-up, graduating from just getting started with data-driven development to operating

article thumbnail

Building a Platform for Content Curation

Afterpay Tech

Photo by Nick Fewings on Unsplash By: Tony Tamplin After years of growth and development on evolving products, Afterpay decided it was time to apply the knowledge accumulated and create a consistent and focused direction across all products. One part of that plan involved rebuilding the website from the ground up to provide features and performance that Afterpay’s users deserve.

article thumbnail

6 Key Concepts, to Master Window Functions

Start Data Engineering

Introduction Prerequisites 6 Key Concepts 1. When to Use 2. Partition By 3. Order By 4. Function 5. Lead and Lag 6. Rolling Window Efficiency Considerations Conclusion Further reading References Introduction If work with data, window functions can significantly level up your SQL skills.

SQL 130
article thumbnail

Announcing O’Reilly’s Data Quality Fundamentals

Monte Carlo

On behalf of the entire company, I’m excited to announce the release of Data Quality Fundamentals: A Practitioner’s Guide to Building More Trustworthy Data Pipelines , published by O’Reilly Media and available for free on the Monte Carlo website. This is the first book published by O’Reilly to educate the market on how best-in-class data teams design and architect technical systems to achieve trustworthy and reliable data at scale.

article thumbnail

Databricks Execution Plans

Advancing Analytics: Data Engineering

The execution plans in Databricks allows you to understand how code will actually get executed across a cluster and is useful for optimising queries. It translates operations into optimized logical and physical plans and shows what operations are going to be executed and sent to the Spark Executors. Execution Flow. Databricks uses Catalyst optimizer, which automatically discovers the most efficient plan to execute the operations specified.

article thumbnail

Entity Resolution Checklist: What to Consider When Evaluating Options

Are you trying to decide which entity resolution capabilities you need? It can be confusing to determine which features are most important for your project. And sometimes key features are overlooked. Get the Entity Resolution Evaluation Checklist to make sure you’ve thought of everything to make your project a success! The list was created by Senzing’s team of leading entity resolution experts, based on their real-world experience.

article thumbnail

Automate Reporting to Drive Value

Teradata

Learn why automating regulatory reporting for value is a requirement and an opportunity that today's banks must embrace.

Banking 52
article thumbnail

6 Responsibilities of a Data Engineer

Start Data Engineering

Introduction Responsibilities of a data engineer 1. Move data between systems 2. Manage data warehouse 3. Schedule, execute, and monitor data pipelines 4. Serve data to the end-users 5. Data strategy for the company 6. Deploy ML models to production Conclusion Further reading Introduction Data engineering is a relatively new field, and as such, there is a huge variance in the actual job responsibilities across different companies.

article thumbnail

Distributing Alert Messages From Grafana With Webhooks

RudderStack

How to create alerts in Grafana and send them to a RudderStack Webhook for delivery to tools like Slack, Microsoft Teams, Email, PagerDuty, and Data Dog.

Data 40
article thumbnail

How And Why To Become Data Driven As A Business

Data Engineering Podcast

Summary Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data.

article thumbnail

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Speaker: David Bard, Principal at VP Product Coaching

In the fast-paced world of digital innovation, success is often accompanied by a multitude of challenges - like the pitfalls lurking at every turn, threatening to derail the most promising projects. But fret not, this webinar is your key to effective product development! Join us for an enlightening session to empower you to lead your team to greater heights.

article thumbnail

Revisiting BetterTLS: Certificate Path Building

Netflix Tech

By Ian Haken Last year the AddTrust root certificate expired and lots of clients had a bad time. Some Roku devices weren’t working right, Heroku had problems , and some folks couldn’t even curl. In the aftermath Ryan Sleevi wrote a really great blog post not just about the issue of this one certificate’s expiry, but the problem that so many TLS implementations have in general with certificate path building.

article thumbnail

What is new in Cloudera Streaming Analytics 1.5?

Cloudera

At the end of May, we released the second version of Cloudera SQL Stream Builder (SSB) as part of Cloudera Streaming Analytics (CSA). Among other features, the 1.4 version of CSA surfaced the expressivity of Flink SQL in SQL Stream Builder via adding DDL and Catalog support, and it greatly improved the integration with other Cloudera Data Platform components, for example via enabling stream enrichment from Hive and Kudu. .

Java 117
article thumbnail

Operational data lineage with dbt

Datakin

dbt is an amazing way to transform data within a data warehouse. So amazing, in fact, that it’s easy to end up doing tons and tons of transformations on all kinds of datasets. After a while, it can become an innavigable collection of overlapping tables. That’s a problem when it comes time to troubleshoot. If you use Datakin to observe your dbt models as they run, you can always know exactly where your datasets came from and how they were created.

article thumbnail

10 Machine Learning Classification Project Ideas for Beginners

ProjectPro

Classification is a supervised machine learning problem requiring the model to label or assign a class (from a fixed number of classes) to an example. The familiar problems of classifying email as spam or not spam, predicting the handwritten character, and so on are all examples of machine learning projects on classification. Classification problems can be broadly classified as binary and multi-class classification, which involves classification into two and more than two classes, respectively,

article thumbnail

Reimagined: Building Products with Generative AI

“Reimagined: Building Products with Generative AI” is an extensive guide for integrating generative AI into product strategy and careers featuring over 150 real-world examples, 30 case studies, and 20+ frameworks, and endorsed by over 20 leading AI and product executives, inventors, entrepreneurs, and researchers.

article thumbnail

Tuning Image Classifiers using Human-In-The-Loop

Zalando Engineering

In this blog post we describe an algorithm we developed when building our product image analysis infrastructure, where we use human-in-the-loop to tune the thresholds of our image classifiers. We discuss the algorithm in the following, and present some mathematical details and a simple code example in the appendices. Background When a customer browses for a product on the Zalando website they may use descriptive terms to search for what they want, for example a customer may use a specific term s

article thumbnail

CAMBI, a banding artifact detector

Netflix Tech

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon Introducing the banding artifacts detector developed by Netflix aiming at further improving the delivered video quality Banding artifacts can be pretty annoying. But, first of all, you may wonder, what is a banding artifact? Banding artifact? You are at home enjoying a show on your brand-new TV.

article thumbnail

A Day in the Life of a DataOps Engineer

DataKitchen

A DataOps implementation project consists of three steps. First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. Second, you must establish a definition of “done.” In DataOps, the definition of done includes more than just some working code. It considers whether a component is deployable, monitorable, maintainable, reusable, secure and adds value to the end-user or customer.

article thumbnail

October 2021 dbt Update: Metrics and Hat Tricks ?

dbt Developer Hub

Hello there, While I have a lot of fun things to share this month, I can't start with anything other than this: Yep, it's official: ? dbt will support metric definitions ? With this feature, you'll be able to centrally define rules for aggregating metrics (think, "active users" or "MRR") in version controlled, tested, documented dbt project code. We still have a ways to go, but in future, you'll be able to explore these metrics in the BI and analytics tools that you know and love.

article thumbnail

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.