Top Data Engineering Digest Cloud Computing Data Architecture Content for August, 2022

August, 2022

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

AUGUST 25, 2022

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Asked yourself what components and features would that include. Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Or you just wanted to govern your hundreds to thousands of files and have more database-like features but don’t know how?

Data Lake

Data Lake Data Warehouse Government Data

ShortCircuitOperator in Apache Airflow: The guide

Marc Lamberti

AUGUST 11, 2022

The ShortCircuitOperator in Apache Airflow is simple but powerful. It allows skipping tasks based on the result of a condition. There are many reasons why you may want to stop running tasks. Let’s see how to use the ShortCircuitOperator and what you should be aware of. By the way, if you are new to Airflow, check my courses here ; you will get at a special discount.

Coding

Coding Python Process IT

Join 16,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

How to gather requirements for your data project

Start Data Engineering

AUGUST 11, 2022

1. Introduction 2. Gathering requirements 2.1. Identify the end-users 2.2. Help end-users define the requirements 2.3. End-user validation 2.4. Deliver iteratively 2.5. Handling changing requirements/new features 3. Conclusion 4. Further reading 5. Reference 1. Introduction Data engineers are often caught off guard by undefined end-user assumptions.

Project

Project Data Engineering Data Engineer Data

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

7 Techniques to Handle Imbalanced Data

KDnuggets

AUGUST 24, 2022

This blog post introduces seven techniques that are commonly applied in domains like intrusion detection or real-time bidding, because the datasets are often extremely imbalanced.

Datasets

Datasets Data Machine Learning

Get Better Network Graphs & Save Analysts Time

Many organizations today are unlocking the power of their data by using graph databases to feed downstream analytics, enahance visualizations, and more. Yet, when different graph nodes represent the same entity, graphs get messy. Watch this essential video with Senzing CEO Jeff Jonas on how adding entity resolution to a graph database condenses network graphs to improve analytics and save your analysts time.

Database

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Summary The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.

Data Engineering

Data Engineering Data Engineer MongoDB Metadata

Real-Time Wildlife Monitoring with Apache Kafka

Confluent

AUGUST 17, 2022

Confluent Hackathon ‘22: Using Apache Kafka a Raspberry Pi, and a camera, Simon Aubury builds a detection and monitoring system to better understand wildlife population trends over time.

Kafka

Kafka Systems Building

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Cloudera

AUGUST 30, 2022

Firms are burdened with tech debt and endless regulatory compliance, often leaving innovation last to receive the necessary budgets. Data-fuelled innovation requires a pragmatic strategy. This blog lays out some steps to help you incrementally advance efforts to be a more data-driven, customer-centric organization. Embrace incremental progress. The financial sector’s evolution is unleashing myriad demands on firms operating in the market.

Cloud Storage

Cloud Storage Government Data Governance Retail

More Trending

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Cloudera

AUGUST 30, 2022

Cloud Storage

Cloud Storage Government Data Governance Retail

Teradata VantageCloud Lake and ClearScape Analytics: Empowering Enterprise Analytical Innovation

Teradata

AUGUST 29, 2022

Teradata's new offerings, VantageCloud Lake and ClearScape Analytics, make it the complete cloud analytics & data platform, with cloud-native deployment and expanded analytics capabilities.

Cloud

Cloud IT Data

Reinforcement Learning for Budget Constrained Recommendations

Netflix Tech

AUGUST 24, 2022

by Ehtsham Elahi with James McInerney , Nathan Kallus , Dario Garcia Garcia and Justin Basilico Introduction This writeup is about using reinforcement learning to construct an optimal list of recommendations when the user has a finite time budget to make a decision from the list of recommendations. Working within the time budget introduces an extra resource constraint for the recommender system.

Algorithm

Algorithm Systems Datasets Architecture

What Does ETL Have to Do with Machine Learning?

KDnuggets

AUGUST 15, 2022

ETL during the process of producing effective machine learning algorithms is found at the base - the foundation. Let’s go through the steps on how ETL is important to machine learning.

Machine Learning

Machine Learning Algorithm Process

Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Data Engineering Podcast

AUGUST 28, 2022

Summary AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.

Building

Building MongoDB Scala MySQL

Understanding User Needs and Satisfying Them

Speaker: Scott Sehlhorst

We know we want to create products which our customers find to be valuable. Whether we label it as customer-centric or product-led depends on how long we've been doing product management. There are three challenges we face when doing this. The obvious challenge is figuring out what our users need; the non-obvious challenges are in creating a shared understanding of those needs and in sensing if what we're doing is meeting those needs.

Certification

Getting Started with Stream Processing: The Ultimate Guide

Confluent

AUGUST 11, 2022

Whether you’re new to stream processing or evaluating real-time data use cases, learn how stream processing works, its benefits, and the best way to get started.

Process

Process IT Data

How Universal Data Distribution Accelerates Complex DoD Missions

Cloudera

AUGUST 11, 2022

We’ve come a long way since 1778 when George Washington’s spies gathered and shared military intelligence on the British Army’s tactical operations in occupied New York. But information broadly, and the management of data specifically, is still “the” critical factor for situational awareness, streamlined operations, and a host of other use cases across today’s tech-driven battlefields. .

Transportation

Transportation Data Ingestion Architecture Data

Reflections on Data Literacy for Financial Services Leaders

Teradata

AUGUST 17, 2022

In conversations with c-level execs at banks & financial institutions, one theme always crops up. How do we change our operating model to be more agile & customer focused in a digital first world?

Banking

Banking Data

Data Quality Monitoring – You’re Doing It Wrong

Monte Carlo

AUGUST 29, 2022

Occasionally, we’ll talk with data teams interested in applying data quality monitoring narrowly across only a specific set of key tables. The argument goes something like: “You may have hundreds or thousands of tables in your environment, but most of your business value derives from only a few that really matter. That’s where you really want to focus your efforts.

IT Metadata Data Warehouse BI

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

Data Science

Data Transformation: Standardization vs Normalization

KDnuggets

AUGUST 12, 2022

Increasing accuracy in your models is often obtained through the first steps of data transformations. This guide explains the difference between the key feature scaling methods of standardization and normalization, and demonstrates when and how to apply each approach.

Data

Data Machine Learning

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data Engineering Podcast

AUGUST 21, 2022

Summary Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.

Lambda Architecture

Lambda Architecture MongoDB Scala MySQL

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

Confluent

AUGUST 16, 2022

Learn how you can integrate data streams into your environment, and enrich data across your existing data pipelines using Confluent Cloud.

Data Pipeline

Data Pipeline Cloud Data

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

In June 2022, Cloudera announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). Iceberg is a 100% open-table format, developed through the Apache Software Foundation , which helps users avoid vendor lock-in and implement an open lakehouse. . The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ).

Data Warehouse

Data Warehouse BI Machine Learning SQL

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

Building

Escaping the Prison of Forecasting

Teradata

AUGUST 10, 2022

Retail and CPG businesses are trapped by the disconnect between today’s digital customers and long-established demand forecasting and supply-chain processes. Find out more.

Retail

Retail Process

MarkLogic And Machine Learning: Easy way of ML

Knoldus

AUGUST 30, 2022

Reading Time: 6 minutes Introduction Machine learning is a subfield of computer science. Used to deal with the construction of artificial intelligence systems that can learn without being explicitly programmed. It has been applied in many areas such as data analysis, pattern recognition, and understanding human behavior. MarkLogic combines database internals, search-style indexing, and application server behavior into a unified system.

Machine Learning

Machine Learning Computer Science Database Data Analysis

The Importance of Experiment Design in Data Science

KDnuggets

AUGUST 12, 2022

Do you feel overwhelmed by the sheer number of ideas that you could try while building a machine learning pipeline? You can not take the liberty of trying all possible ways to arrive at a solution - hence we discuss the importance of experiment design in data science projects.

Data Science

Data Science Designing Machine Learning Data

Understanding The Role Of The Chief Data Officer

Data Engineering Podcast

AUGUST 21, 2022

Summary The position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.

Metadata

Metadata MongoDB MySQL Data Lake

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.

Building

Serverless Stream Processing with Apache Kafka, Azure Functions, and ksqlDB

Confluent

AUGUST 10, 2022

Confluent’s ksqlDB product offers powerful, serverless stream processing tools that maximize Kafka on Azure.

Kafka

Kafka Process

The future of data architecture is hybrid: choosing your hybrid-first data strategy starts at Cloudera Now 2022

Cloudera

AUGUST 9, 2022

With all of the buzz around cloud computing, many companies have overlooked the importance of hybrid data. Many large enterprises went all-in on cloud without considering the costs and potential risks associated with a cloud-only approach. The truth is, the future of data architecture is all about hybrid. Hybrid data capabilities enable organizations to collect and store information on premises, in public or private clouds, and at the edge — without sacrificing the important analytics needed to

Data Architecture

Data Architecture Architecture Cloud Computing Cloud

An "Everything Data" Approach to Smart Cities

Teradata

AUGUST 3, 2022

Teradata’s approach to the Smart City is an analytics-centric, city-data-ecosystem approach designed to give access across all relevant data. Find out more.

Data

Data Designing Accessible Accessibility

5 Steps to Operationalizing Data Observability with Monte Carlo?

Monte Carlo

AUGUST 25, 2022

“How do we scale data observability with Monte Carlo?” I’ve heard this from hundreds of new customers. They’re excited about all that data observability can do for them, but like with any new software, they want prescriptive guidance. “In the ‘Crawl → Walk → Run’ of software adoption, what’s the quickest way for my team to start crawling?” If you’re a data team of 5-15 engineers or analysts, I recommend building healthy data observability muscles using our end-to-end, out-of-the-box monitors , a

Datasets

Datasets BI Data SQL

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Speaker: Anne Steiner and David Laribee

As a concept, Developer Experience (DX) has gained significant attention in the tech industry. It emphasizes engineers’ efficiency and satisfaction during the product development process. As product managers, we need to understand how a good DX can contribute not only to the well-being of our development teams but also to the broader objectives of product success and customer satisfaction.

Engineering

How Do Data Scientists and Data Engineers Work Together?

KDnuggets

AUGUST 18, 2022

If you’re considering a career in data science, it’s important to understand how these two fields differ, and which one might be more appropriate for someone with your skills and interests.

Data Engineering

Data Engineering Data Engineer Engineering Data Science

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Summary Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.

Metadata

Metadata MongoDB Scala MySQL

Getting Started with the KRaft Protocol

Confluent

AUGUST 31, 2022

Kafka Raft lets you use Apache Kafka without ZooKeeper by consolidating metadata management. Here’s how you can learn and do more with KRaft.

Kafka

Kafka Metadata Management

Applying Fine Grained Security to Apache Spark

Cloudera

AUGUST 3, 2022

Fine grained access control (FGAC) with Spark. Apache Spark with its rich data APIs has been the processing engine of choice in a wide range of applications from data engineering to machine learning, but its security integration has been a pain point.t Many enterprise customers needi finer granularity of control, in particular at the column and row level (commonly known as Fine Grained Access Control or FGAC).

Datasets

Datasets BI Machine Learning Accessible

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Speaker: David Bard, Principal at VP Product Coaching

In the fast-paced world of digital innovation, success is often accompanied by a multitude of challenges - like the pitfalls lurking at every turn, threatening to derail the most promising projects. But fret not, this webinar is your key to effective product development! Join us for an enlightening session to empower you to lead your team to greater heights.

Certification

August, 2022

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

ShortCircuitOperator in Apache Airflow: The guide

Webinars

Trending Sources

How to gather requirements for your data project

Webinars

7 Techniques to Handle Imbalanced Data

Get Better Network Graphs & Save Analysts Time

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Real-Time Wildlife Monitoring with Apache Kafka

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Sign up to get articles personalized to your interests!

More Trending

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Teradata VantageCloud Lake and ClearScape Analytics: Empowering Enterprise Analytical Innovation

Reinforcement Learning for Budget Constrained Recommendations

What Does ETL Have to Do with Machine Learning?

Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Understanding User Needs and Satisfying Them

Getting Started with Stream Processing: The Ultimate Guide

How Universal Data Distribution Accelerates Complex DoD Missions

Reflections on Data Literacy for Financial Services Leaders

Data Quality Monitoring – You’re Doing It Wrong

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Data Transformation: Standardization vs Normalization

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

How to Use Apache Iceberg in CDP’s Open Lakehouse

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Escaping the Prison of Forecasting

MarkLogic And Machine Learning: Easy way of ML

The Importance of Experiment Design in Data Science

Understanding The Role Of The Chief Data Officer

The Big Payoff of Application Analytics

Serverless Stream Processing with Apache Kafka, Azure Functions, and ksqlDB

The future of data architecture is hybrid: choosing your hybrid-first data strategy starts at Cloudera Now 2022

An "Everything Data" Approach to Smart Cities

5 Steps to Operationalizing Data Observability with Monte Carlo?

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

How Do Data Scientists and Data Engineers Work Together?

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Getting Started with the KRaft Protocol

Applying Fine Grained Security to Apache Spark

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Stay Connected