March, 2021

article thumbnail

Building a Data Engineering Project in 20 Minutes

Simon Späti

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster. The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence

article thumbnail

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

Just an illustration – not the truth and you certainly can do it with other technologies. TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. the selfserve platform based on a serverless philisophy (life is too short to do provisioning) the building of data products (as code) : we are building data workflows not data pipelines the promotion of data domains where the metadata on the data life cycle is as important as your data The old dat

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Confluent

At the heart of Apache Kafka® sits the log—a simple data structure that uses sequential operations that work symbiotically with the underlying hardware. Efficient disk buffering and CPU cache usage, […].

Kafka 145
article thumbnail

Data Quality Management For The Whole Team With Soda Data

Data Engineering Podcast

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable.

article thumbnail

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

article thumbnail

How to trigger a spark job from AWS Lambda

Start Data Engineering

Event driven pipelines Lambda function to trigger spark jobs Setup and run Monitoring and logging Teardown Conclusion Further reading References Event driven pipelines Event driven systems represent a software design pattern where a logic is executed in response to an event. This event can be a file creation on S3, a new database row, API call, etc.

AWS 100
article thumbnail

CFO Analytics: What Is It and Why Should You Care?

Teradata

Finance-driven analytics might be the largest untapped opportunity for organizations & a catalyst for driving business value & strategic vision. But, what exactly is CFO analytics?

IT 119

More Trending

article thumbnail

International Women’s Day 2021: Challenging what’s possible

Cloudera

This year’s International Women’s Day (IWD) on March 8th comes at a time when global communities, businesses, and governments find themselves continuing to pirouette, pivot, and adapt in the face of a relentless, global pandemic. . COVID-19 has touched every aspect of our lives. As women, overnight we suddenly found that we had a portfolio career – comprising our day jobs, caregiver, school teacher and house cleaner – that we had neither asked for, nor were consulted on. .

Portfolio 108
article thumbnail

Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Confluent

Real-time analytics has become the need of the hour for modern internet companies. The ability to derive internal insights around business metrics, user growth and adoption as well as security […].

Kafka 144
article thumbnail

Real World Change Data Capture At Datacoral

Data Engineering Podcast

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems.

article thumbnail

ConsoleMe: A Central Control Plane for AWS Permissions and Access

Netflix Tech

ConsoleMe: A Central Control Plane for AWS Permissions and Access By Curtis Castrapel , Patrick Sanders , and Hee Won Kim At AWS re:Invent 2020, we open sourced two new tools for managing multi-account AWS permissions and access. We’re very excited to bring you ConsoleMe (pronounced: kuhn-soul-mee ), and its CLI utility, Weep (pun intended)! If you missed the talk, check it out here.

AWS 100
article thumbnail

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Speaker: Anne Steiner and David Laribee

As a concept, Developer Experience (DX) has gained significant attention in the tech industry. It emphasizes engineers’ efficiency and satisfaction during the product development process. As product managers, we need to understand how a good DX can contribute not only to the well-being of our development teams but also to the broader objectives of product success and customer satisfaction.

article thumbnail

How to Host a Virtual Global Data Science Hackathon

Teradata

Learn how best to host a virtual hackathon, or any virtual event, with these tips and tricks from our Teradata team. Read more.

article thumbnail

Reverse ETL with dbt and Grouparoo

Grouparoo

Teams are centralizing their data in their data warehouse by loading data in and transforming it as necessary. Increasingly, we are seeing teams turn to dbt to do this transforming. The idea is to write *.sql files that, when run in the right order, create useful rollup tables or materialized views of the data. We've been asked by teams using dbt how Grouparoo can then sync their data to their cloud-based apps.

article thumbnail

Congratulations to our 2021 Partner Award Winners

Cloudera

We announced at our Partner Sales Kickoff, the winners of the 2021 Cloudera Partner Awards. These six awards recognize Cloudera partners who are dedicated to enabling customers to do more with their data by leveraging the power of an enterprise data cloud. Thank you to this year’s winners for their partnership in helping our joint customers’ ability to drive value from their data in the hybrid cloud.

article thumbnail

Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana

Confluent

Self-managing a highly scalable distributed system with Apache Kafka® at its core is not an easy feat. That’s why operators prefer tooling such as Confluent Control Center for administering and […].

Kafka 130
article thumbnail

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

article thumbnail

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data Engineering Podcast

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way.

IT 100
article thumbnail

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Netflix Tech

Stephanie Lane , Wenjing Zheng , Mihir Tendulkar Source credit: Netflix Within the rapid expansion of data-related roles in the last decade, the title Data Scientist has emerged as an umbrella term for myriad skills and areas of business focus. What does this title mean within a given company, or even within a given industry? It can be hard to know from the outside.

article thumbnail

Enterprise Data Operating Systems in the Cloud: Necessary, But Not Sufficient

Teradata

Getting your Cloud data architecture right starts with understanding which data products you need, the roles they perform, & the functional & non-functional characteristics that those roles demand.

Cloud 110
article thumbnail

Community, Metadata Management, and More: Top 10 Links From Across the Web

Data Council

Here's our March 2021 roundup of links from across the web that we selected for you: 1. How to Build a Community (Fishtown Analytics) Claire Carroll's first personal blog post on community-building is a must-read. As Fishtown Analytics' community manager for the last 2.5 years, she's arguably behind the success of the dbt community and its best-in-class practices, so we expected good advice… but she really hit the ball out of the park with this one!

article thumbnail

Entity Resolution Checklist: What to Consider When Evaluating Options

Are you trying to decide which entity resolution capabilities you need? It can be confusing to determine which features are most important for your project. And sometimes key features are overlooked. Get the Entity Resolution Evaluation Checklist to make sure you’ve thought of everything to make your project a success! The list was created by Senzing’s team of leading entity resolution experts, based on their real-world experience.

article thumbnail

CDP Endpoint Gateway provides Secure Access to CDP Public Cloud Services running in private networks

Cloudera

Cloudera Data Platform (CDP) Public Cloud allows users to deploy analytic workloads into their cloud accounts. These workloads cover the entire data lifecycle and are managed from a central multi-cloud Cloudera Control Plane. CDP provides the flexibility to deploy these resources into public or private subnets. Nearly unanimously, we’ve seen customers deploy their workloads to private subnets.

article thumbnail

How to Tune RocksDB for Your Kafka Streams Application

Confluent

Apache Kafka ships with Kafka Streams, a powerful yet lightweight client library for Java and Scala to implement highly scalable and elastic applications and microservices that process and analyze data […].

Kafka 130
article thumbnail

Promisifying Your Node Callback Functions

Grouparoo

The Grouparoo application is written in JavaScript (Node). It uses the modern promise-based pattern ( async / await ) for reading and writing data asynchronously. And we do this a lot — we are a data sync tool! Every once in awhile we'll come across a JavaScript library that is written around the old callback-based pattern, where the error object is the first parameter in the callback function, followed by the result.

article thumbnail

Scaling Revenue & Growth Tooling

Netflix Tech

Written by Nick Tomlin , Michael Possumato , and Rahul Pilani. This post shares how the Revenue & Growth Tools (RGT) team approaches creating full-stack tools for the teams that are the financial backbone of Netflix. Our primary partners are the teams of Revenue and Growth Engineering (RGE): Growth, Membership, Billing, Payments, and Partner Subscription.

article thumbnail

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.

article thumbnail

Enhancing Customer Experience with Every Journey

Teradata

Big Tech giants dominate by using data to improve product & experience. The auto industry can emulate this by analyzing data to improve customer experience & guide individual choices.

Data 95
article thumbnail

Building the Future of Payments With RippleNet’s VP of Engineering

Ripple Engineering

Amidst the work-from-home environment, Vidya Mani joined Ripple in early 2020 as the Vice President of Engineering for RippleNet. A year into her role, she focuses on improving Ripple’s infrastructure and strengthening her team to further the company’s vision for a more inclusive financial system. RippleNet is an enterprise solution which helps banks and other financial institutions streamline global payments and reach new customers.

article thumbnail

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

Governance and the sustainable handling of data is a critical success factor in virtually all organizations. While Cloudera Data Platform (CDP) already supports the entire data lifecycle from ‘Edge to AI’, we at Cloudera are fully aware that enterprises have more systems outside of CDP. It is crucial to avoid that CDP becomes the next silo in your IT landscape.

article thumbnail

To Pull or to Push Your Data with Kafka Connect? That Is the Question.

Confluent

Today, every company is a data company. There are many different data pipeline, integration, and ingestion tools in the market, but before you can feed your data analytics needs, data […].

Kafka 125
article thumbnail

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Speaker: David Bard, Principal at VP Product Coaching

In the fast-paced world of digital innovation, success is often accompanied by a multitude of challenges - like the pitfalls lurking at every turn, threatening to derail the most promising projects. But fret not, this webinar is your key to effective product development! Join us for an enlightening session to empower you to lead your team to greater heights.

article thumbnail

Data-driven performance improvements: Distance running and data

Retail Insight

As a former distance runner, I have seen first-hand how investment in elite sport is accelerating athletic performance. Just as the world has developed, our physical capabilities have too. New lifestyles, technologies, science , and world-class facilities all help to enabl e athletes to go ‘Faster – Higher – Stronger’, as the Olympics motto states.

Data 52
article thumbnail

Dogfooding your product

Grouparoo

“Eating your own dogfood” or “dogfooding” is a term that always felt a bit odd to me, but the principles underlying it are incredibly important to product teams small and large. In short, Dogfooding means using your own product in order to better empathize with your users. When you build more empathy for your users, you build a better product. I’ll be sharing some thoughts on why dogfooding is important and some pointers on how to dogfood well.

article thumbnail

Don’t Just Collect Vehicle Data – Monetize It!

Teradata

As the auto sector transforms, vehicle data is becoming one of the most important sources of insight. But if it is left in fragmented silos, it quickly becomes a cost & delivers little value.

IT 64
article thumbnail

Deep Learning vs Machine Learning -What's the Difference?

ProjectPro

“Machine Learning” and “Deep Learning” – are two of the most often confused and conflated terms that are used interchangeably in the AI world. However, there is one undeniable fact that both machine learning and deep learning are undergoing skyrocketing growth. According to Forbes , the global machine learning market will be worth $30.6 billion by 2024 and the deep learning market size is expected to reach $10.2 billion by 2025, expanding at a CAGR of 42.8% and 52.1

article thumbnail

Driving Business Impact for PMs

Speaker: Jon Harmer, Product Manager for Google Cloud

Move from feature factory to customer outcomes and drive impact in your business! This session will provide you with a comprehensive set of tools to help you develop impactful products by shifting from output-based thinking to outcome-based thinking. You will deepen your understanding of your customers and their needs as well as identifying and de-risking the different kinds of hypotheses built into your roadmap.