November, 2018

article thumbnail

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

These days, everyone talks about open-source. However, this is still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

article thumbnail

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Data Engineering Podcast

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start us

Process 100
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber Engineering

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex … The post Observability at Scale: Building Uber’s Alerting Ecosystem appeared first on Uber Engineering Blog.

Building 111
article thumbnail

Netflix Information Security: Preventing Credential Compromise in AWS

Netflix Tech

by Will Bengtson Previously we wrote about a method for detecting credential compromise in your AWS environment. The methodology focused on a continuous learning model and first use principle. This solution still is reactive in nature?—?we only detect credential compromise after it has already happened. Even with detection capabilities, there is a risk that exposed credentials can provide access to sensitive data and/or the ability to cause damage in our environment.

AWS 95
article thumbnail

Get Better Network Graphs & Save Analysts Time

Many organizations today are unlocking the power of their data by using graph databases to feed downstream analytics, enahance visualizations, and more. Yet, when different graph nodes represent the same entity, graphs get messy. Watch this essential video with Senzing CEO Jeff Jonas on how adding entity resolution to a graph database condenses network graphs to improve analytics and save your analysts time.

article thumbnail

Collaboration Between Data Science and Data Engineering: True or False?

Domino Data Lab: Data Engineering

This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript.

article thumbnail

Five strategies for skills-based volunteering: Lessons learned from Cloudera Cares first-ever Global Day of Service

Cloudera

Corporate volunteering is on the rise. However, only half of companies encourage their employees to participate in skills-based volunteering – defined as employees applying their abilities and specialized talents to challenges facing their communities. As the Program Manager for Cloudera Cares, Cloudera’s employee giving and volunteering program at the Cloudera Foundation, I believe that we can have more impact if we offer employees opportunities for skills-based volunteering.

Food 44

More Trending

article thumbnail

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Data Engineering Podcast

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Data Lake 100
article thumbnail

Zalando Research Releases “Flair”

Zalando Engineering

Open sourcing machine learning research for natural language processing (NLP) Two years ago, Zalando Research launched with a clear purpose to ensure that Zalando Tech is at the forefront of research in the areas of data science, machine learning, natural language processing and artificial intelligence. Our researchers’ work previously focused mainly within Zalando.

article thumbnail

Rockset's RocksDB-Cloud Library - Enabling the Next Generation of Cloud Native Databases

Rockset

Rockset and I began collaborating in 2016 due to my interest in their RocksDB-Cloud open-source key-value store. This post is primarily about the RocksDB-Cloud software, which Rockset open-sourced in 2016, rather than Rockset's newly launched cloud service. In it, I will explore how RocksDB-Cloud can be used to build an open-source cloud-friendly storage system.

article thumbnail

Delivering Meaning with Previews on Web

Netflix Tech

By Corey Grunewald and Tony Casparro As the Netflix catalog of films and series continues to grow, it becomes more challenging to present members with enough information to decide what to watch. How can a member tell if a movie is both a horror and a comedy? The synopsis and artwork help provide some context, but how can we leverage video previews (trailers) to help members find something great to watch?

article thumbnail

Understanding User Needs and Satisfying Them

Speaker: Scott Sehlhorst

We know we want to create products which our customers find to be valuable. Whether we label it as customer-centric or product-led depends on how long we've been doing product management. There are three challenges we face when doing this. The obvious challenge is figuring out what our users need; the non-obvious challenges are in creating a shared understanding of those needs and in sensing if what we're doing is meeting those needs.

article thumbnail

Cloudera Named a Fastest Growing Company by Deloitte for Fourth Year

Cloudera

For the fourth time in the past five years, Cloudera has been named to Deloitte’s Technology Fast 500 as one of the fastest growing companies in North America. This annual ranking showcases the growth of companies in the technology, media, telecommunications, life sciences, and energy tech sectors. This year’s list demonstrated the power of combining breakthrough research and development, entrepreneurship and rapid growth, with software companies like Cloudera making up nearly two-thirds of the

article thumbnail

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Data Engineering Podcast

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page.

article thumbnail

Digital Transformation Focused on Sustainability

Cloudera

My inspiration for writing this blog was a recent trip to a warehouse and distribution center of a well-known U.S. fast-food enterprise with a reputation for superior quality. During my visit, I had the opportunity to chat with the center’s Manager for Food Safety whose credentials (Ph.D. in Food Science), knowledge, and experience reflect the company’s commitment to product safety and quality.

Food 40
article thumbnail

Train Deep Learning Models on AWS

Zalando Engineering

A real-life example of how to train a Deep Learning model on an AWS Spot Instance using Spotty Spotty is a tool that simplifies training of Deep Learning models on AWS. Why will you ❤️this tool? it makes training on AWS GPU instances as simple as a training on your local computer it automatically manages all necessary AWS resources including AMIs, volumes and snapshots it makes your model trainable on AWS by everyone with a couple of commands it detaches remote processes from SSH sessions it sav

article thumbnail

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

article thumbnail

Connexion 2.0 Release

Zalando Engineering

Today, we released Connexion 2.0 with OpenAPI 3 support. Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI Specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write a Swagger specification, then maps the endpoints to your Python functions. Besides routing, Connexion also validates requests and responses automatically based on OpenAPI specifications, handles common authentication schemes, supports API

Python 40
article thumbnail

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

Data Engineering Podcast

Summary When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform.

Data Lake 100
article thumbnail

Netflix at AWS re:Invent 2018

Netflix Tech

by Shaun Blackburn AWS re:Invent is back in Las Vegas this week! Many Netflix engineers and leaders will be among the 40,000 attending the conference to connect with fellow cloud and OSS enthusiasts. You can find us at our booth on the expo floor, speaking on a variety of subjects, and at meetups and events around the re:Invent campus. We have listed all our talks below to make it easy to hear what we have been up to.

AWS 45
article thumbnail

Dynamic Typing in SQL

Rockset

As Peter Bailis put it in his post , querying unstructured data using SQL is a painful process. Moreover, developers frequently prefer dynamic programming languages, so interacting with the strict type system of SQL is a barrier. We at Rockset have built the first schemaless SQL data platform. In this post and a few others that follow, we'd like to introduce you to our approach.

SQL 40
article thumbnail

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

article thumbnail

Open Source: October Review - Hacktoberfest, new releases and more.

Zalando Engineering

Project Highlights Connexion version 2.0 with OpenAPI 3 support is ready, check out what is new in our latest release! Connexion is the Swagger/OpenAPI first framework for Python on top of Flask with automatic endpoint validation & OAuth2 support. With 87 active contributors and more than 1,000 repositories that depend on Connexion worldwide makes this project one of the most successful open source releases of Zalando.

article thumbnail

An introduction to Federated Learning

Cloudera

We’re excited to release Federated Learning , the latest report and prototype from Cloudera Fast Forward Labs. Federated learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy and reduces communication costs. This article is about the business case for federated learning.

article thumbnail

Tag-based Navigation of a Fashion Catalog

Zalando Engineering

Exploring the Zalando Assortment by Browsing a Product Similarity Graph Introduction As Europe's leading online fashion and lifestyle platform, Zalando is continually developing new features to enable our customers to find the products they want. While the standard tools of Search, Categorization & Attribute Filtering are par-for-the-course for purchasing items online, with an ever-expanding fashion assortment and an increase in the data available to describe a product, this browsing experie

article thumbnail

Zalando Postgres Operator: One Year Later

Zalando Engineering

Zalando Postgres operator: one year later The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

article thumbnail

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.

article thumbnail

Why SQL on Raw Data?

Rockset

Over a decade after the inception of the Hadoop project, the amount of unstructured data available to modern applications continues to increase. Moreover, despite forecasts to the contrary, SQL remains the lingua franca of data processing; today's NoSQL and Big Data infrastructure platform usage often involves some form of SQL-based querying. This longevity is a testament to the community of analysts and data practitioners who are familiar with SQL as well as the mature ecosystem of tools around

article thumbnail

Making smart cities safer with data

Cloudera

By Mark Micallef, Vice President of Asia Pacific and Japan , Cloudera. What comes to your mind when you think of the term “smart city”? For me, it conjures an image of a city where everything is interconnected, enabling it to run efficiently and offer convenient, secure, and personalized services to its residents at the touch of their fingertips. While such a city might sound like a utopian dream, it could potentially turn into a dystopian nightmare if we overlook the risks brought about by the

Banking 56