Trending Articles

article thumbnail

Airflow Sensors: What you need to know

Marc Lamberti

Airflow Sensors are one of the most common tasks in data pipelines. Why? Because a Sensor waits for a condition to be true to complete. Do you need to wait for a file? Check if an SQL entry exists? Delay the execution of a DAG? That’s the few possibilities of the Airflow Sensors. If you want to make complex and robust data pipelines, you have to understand how Sensors work genuinely.

article thumbnail

Working at a Startup vs in Big Tech

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of four topics in today’s subscriber-only The Pulse issue. To get full newsletters twice a week, subscribe here. Willem Spruijt is a software engineer whom I worked on the same team with at Uber in Amsterdam, building payments systems.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Why are Cloud Development Environments Spiking in Popularity, Now?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover a fresh industry trends: Cloud Developent Environments — which is analysis full subscribers have received 3 weeks ago.

Cloud 280
article thumbnail

Building ETL Pipelines With Generative AI

Data Engineering Podcast

Summary Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

Building 147
article thumbnail

The Definitive Entity Resolution Buyer’s Guide

Are you thinking of adding enhanced data matching and relationship detection to your product or service? Do you need to know more about what to look for when assessing your options? Our Entity Resolution Buyer’s Guide gives you step-by-step details about everything you should consider when evaluating entity resolution technologies. We discuss use cases, technology, and deployment options, top ten evaluation criteria and more.

article thumbnail

DuckDB + Delta Lake (the new lake house?)

Confessions of a Data Guy

I always leave it to my dear readers and followers to give me pokes in the right direction. Nothing like the teaming masses to set you straight. Recently I was working on my Substack Newsletter, on the topic of Polars + Delta Lake, reading remove files from s3 … I left a question open on […] The post DuckDB + Delta Lake (the new lake house?

Data 147

More Trending

article thumbnail

Airflow Variables: The Ultimate Guide

Marc Lamberti

Airflow Variables are easy to use but easy to misuse as well. In this tutorial, you will learn everything you need about variables in Apache Airflow. What are they, how do they work, define one, get the value, and more. If you followed my course “Apache Airflow: The Hands-On Guide” variables shouldn’t sound unfamiliar. This time, I will give you all I know about variables so that, in the end, you will be ready to use Variables in your DAGs properly.

Database 130
article thumbnail

Arbitrary stateful processing in PySpark with applyInPandasWithState

Waitingforcode

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Scala 147
article thumbnail

Introduction to using Rust Libraries (cargo and crates)

Confessions of a Data Guy

So perhaps you’re thinking it’s time to use Rust on your next project. You’ll find plenty of primers on how to get your feet wet in the language (and if you somehow made it this far without that much, The Book is that starting point), but maybe you’re feeling a bit lost amidst the seas […] The post Introduction to using Rust Libraries (cargo and crates) appeared first on Confessions of a Data Guy.

Project 130
article thumbnail

Getting Started with PyTorch in 5 Steps

KDnuggets

This tutorial provides an in-depth introduction to machine learning using PyTorch and its high-level wrapper, PyTorch Lightning. The article covers essential steps from installation to advanced topics, offering a hands-on approach to building and training neural networks, and emphasizing the benefits of using Lightning.

article thumbnail

Deploy Private LLMs using Databricks Model Serving

databricks

We are excited to announce public preview of GPU and LLM optimization support for Databricks Model Serving! With this launch, you can deploy.

article thumbnail

Getting started with Airflow in 10 mins

Marc Lamberti

At the end of this introduction to Airflow, you will be all set for getting started with Airflow. You will start with the basics, such as what Airflow is and the essential concepts. Then you will set up and run your local development environment using the Astro CLI to create your first data pipeline. I hope you’re getting excited. Fasten your seatbelt, take a deep breath, and let’s go For a complete hands-on introduction to Apache Airflow, here is a 6-hour course at a discount.

article thumbnail

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Cloudera

Organizations increasingly rely on streaming data sources not only to bring data into the enterprise but also to perform streaming analytics that accelerate the process of being able to get value from the data early in its lifecycle. As lakehouse architectures (including offerings from Cloudera and IBM) become the norm for data processing and building AI applications, a robust streaming service becomes a critical building block for modern data architectures.

Kafka 94
article thumbnail

How Snowflake Native Apps Deliver Security for App Builders and Consumers

Snowflake

The Snowflake Native App Framework , which leverages Snowflake’s advanced architecture, allows for a new level of security for applications. This security spans not just the application consumer, but also the application providers. Controlling all software and infrastructure in the Snowflake Data Cloud, Snowflake can secure the application code to protect the intellectual property (IP) of builders.

Python 92
article thumbnail

A Comparative Overview of the Top 10 Open Source Data Science Tools in 2023

KDnuggets

Are you looking for the open source tools to help you in your data science journey? Look no further. Discover these game-changers that will elevate your data-driven decisions.

article thumbnail

Working with Esri Vector Basemaps in ArcGIS Pro

ArcGIS

Esri Vector Basemaps are available for use in ArcGIS Pro, and that opens up some new possibilities for you.

Designing 111
article thumbnail

Ballard Power Systems RDU (Remote Diagnostics Unit) Visualization Platform for Interactive At-Scale Industrial IoT Streaming Analytics

databricks

This article represents a collaborative effort between Plotly, Ballard Power Systems, and Databricks. Fleets of buses worldwide run on hydrogen fuel cells made.

Systems 86
article thumbnail

Strengthening Your Data Ecosystem with Unrivaled Security

Cloudera

As data ecosystems evolve security becomes a paramount concern, especially within the realm of private cloud environments. Cloudera on Private Cloud with the Private Cloud Base (CDP PvC Base) stands as a beacon of innovation in the realm of data security, offering a holistic suite of features that work in concert to safeguard sensitive information. With the latest 7.1.9 release , the journey towards a more secure data ecosystem continues — one where businesses can unlock the full potential of th

Data 89
article thumbnail

Lessons from debugging a tricky direct memory leak

Pinterest Engineering

Sanchay Javeria | Software Engineer, Ads Data Infrastructure To support metrics reporting for ads from external advertisers and real-time ad budget calculations at Pinterest, we run streaming pipelines using Apache Flink. These jobs have guaranteed an overall 99th percentile availability to our users; however, every once in a while some tasks get hit with nasty direct out-of-memory (OOM) errors on multiple operators that look something like this: As is the case with most failures in a distribute

Coding 81
article thumbnail

Getting Started with Google Cloud Platform in 5 Steps

KDnuggets

Explore the essentials of Google Cloud Platform for data science and ML, from account setup to model deployment, with hands-on project examples.

article thumbnail

Robinhood Partners with Operation HOPE in Support of the 1865 Project 

Robinhood

The initiative supports Operation HOPE’s mission to expand economic opportunity Robinhood Markets, Inc. is deepening our partnership with Operation HOPE through the 1865 Project , a new initiative that supports the organization’s mission to expand economic opportunity in underserved communities through financial education and empowerment. With our support, the 1865 project will allow Operation HOPE to further grow and scale its work across America and fuel innovative programs and technologies.

Project 80
article thumbnail

easyJet bets on Databricks Lakehouse and Generative AI to be an Innovation Leader in Aviation

databricks

This blog is authored by Ben Dias, Director of Data Science and Analytics and Ioannis Mesionis, Lead Data Scientist at easyJet Introduction to.

article thumbnail

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Workfall

Reading Time: 9 minutes In this blog, we will cover: What are Server-Sent Events? Why Stream Data Using Server-Sent Events (SSE)? What is FastAPI? Hands-On Conclusion What are Server-Sent Events? Server-Sent Events (SSE) is a simple and efficient technology for sending real-time updates from the server to the web browser over a single HTTP connection.

Python 86
article thumbnail

Training Foundation Improvements for Closeup Recommendation Ranker

Pinterest Engineering

Fan Jiang | Software Engineer, Closeup Candidate Retrieval; Liyao Lu | Software Engineer, Closeup Ranking & Blending; Laksh Bhasin | Software Engineer, Core ML Foundations; Chen Yang | Software Engineer, Core ML Foundations; Shivin Thukral | Software Engineer, Closeup Ranking & Blending; Travis Ebesu | Software Engineer, Closeup Ranking & Blending; Kent Jiang | Software Engineer, Core Serving Infra; Yan Sun | Engineering Manager, Closeup Ranking & Blending; Huizhong Duan | Engine

article thumbnail

Top 7 Free Cloud Notebooks for Data Science

KDnuggets

Cloud notebooks are game-changers for data science, providing free access to computing, pre-built environments, collaboration features, and third-party integrations - everything you need to enhance your workflow.

article thumbnail

Career stories: The math-music connection in data science

LinkedIn Engineering

When Javier signed up for a programming course during the pandemic, he had no idea that his career was about to shift from the world of music to data science. As his interest in AI and computer science grew, Javier found a community at LinkedIn that supported his growth and provided more opportunities to learn and lead than he could have imagined. Making the leap from music to LinkedIn Engineering with REACH My journey to LinkedIn and passion for coding came from an entirely different background

article thumbnail

Data Access API over Data Lake Tables Without the Complexity

Towards Data Science

Data Access API over Data Lake Tables Without the Complexity Build a robust GraphQL API service on top of your S3 data lake files with DuckDB and Go Photo by Joshua Sortino on Unsplash 1. Intro Data lake tables are mostly utilized by data engineering teams using big data compute engines, such as Spark or Flink, as well as by data analysts and scientists creating models and reports with heavy SQL query engines, such as Trino or Redshift.

article thumbnail

4 Ways Better Access to Healthcare Data Can Improve Patient Outcomes

Snowflake

From improving patient outcomes to increasing clinical efficiencies, better access to data is helping healthcare organizations deliver better patient care. Data from hospitals, pharmacies, clinics, insurers, community and public health organizations, telehealth visits and wellness apps can be combined to provide a comprehensive view of patient health.

article thumbnail

Old School: Adapting Esri Basemaps for Printed Products

ArcGIS

Esri basemaps are designed to be used at multiple scales, but a static map needs everything in one view. How doe we get around that?

article thumbnail

The Quest for Model Confidence: Can You Trust a Black Box?

KDnuggets

This article explores strategies for evaluating the reliability of labels generated by Large Language Models (LLMs). It discusses the effectiveness of different approaches and offers practical insights for various applications.

article thumbnail

Pinternship Wrap-Up: Summer 2023

Pinterest Engineering

Each summer, Pinterest welcomes Software Engineering Pinterns who spend 12 weeks with us creating impact within our product and teams. While Pinterns are fully immersed in their teams throughout the summer, they also get to attend exciting activities and events hosted by the University Recruiting team and within the company. Here’s a quick recap from this summer: Social events were a hit with boba tea making, creating your own vision board, chocolate making and a virtual escape room.

article thumbnail

Announcing the Public Preview of Lakeview Dashboards!

databricks

We are excited to announce the public preview of the next generation of Databricks SQL dashboards, dubbed Lakeview dashboards. Available today, this new.

SQL 82
article thumbnail

Marketing Success in the Age of AI: Celebrating EMEA’s Modern Marketing Data Stack Pioneers

Snowflake

Data is an invaluable asset in today’s marketing ecosystem. With its unique blend of cultures, economies, and regulatory environments, the EMEA market offers a nuanced picture of how marketers harness data technologies to understand their audiences, calibrate campaigns in real time, and adhere to complex government and industry regulations. In our second annual Snowflake Modern Marketing Data Stack 2023 report , we delve into actual usage and adoption of marketing technologies within the Snowfla

Data 76
article thumbnail

Combatting Spammers in Real Time: Unleashing the Power of AI

Confluent

Use Confluent connectors, stream governance and stream processing and artificial intelligence (AI) to block spammers, reduce network load, and improve your customers' experiences in real time.