Top Data Engineering Digest Aggregated Data Analytics Application Content for November, 2018

November, 2018

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

NOVEMBER 28, 2018

These days, everyone talks about open-source. However, this is still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

Data Warehouse

Data Warehouse Data Storage Data Architecture Architecture

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Data Engineering Podcast

NOVEMBER 18, 2018

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start us

Process

Process Scala Google Cloud Kafka

Join 16,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber Engineering

NOVEMBER 20, 2018

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex … The post Observability at Scale: Building Uber’s Alerting Ecosystem appeared first on Uber Engineering Blog.

Building

Building Architecture Engineering

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Netflix Information Security: Preventing Credential Compromise in AWS

Netflix Tech

NOVEMBER 28, 2018

by Will Bengtson Previously we wrote about a method for detecting credential compromise in your AWS environment. The methodology focused on a continuous learning model and first use principle. This solution still is reactive in nature?—?we only detect credential compromise after it has already happened. Even with detection capabilities, there is a risk that exposed credentials can provide access to sensitive data and/or the ability to cause damage in our environment.

AWS

AWS Metadata Amazon Web Services Cloud

Get Better Network Graphs & Save Analysts Time

Many organizations today are unlocking the power of their data by using graph databases to feed downstream analytics, enahance visualizations, and more. Yet, when different graph nodes represent the same entity, graphs get messy. Watch this essential video with Senzing CEO Jeff Jonas on how adding entity resolution to a graph database condenses network graphs to improve analytics and save your analysts time.

Collaboration Between Data Science and Data Engineering: True or False?

Domino Data Lab: Data Engineering

NOVEMBER 18, 2018

This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript.

Data Science

Data Science Data Engineering Data Engineer Engineering

Five strategies for skills-based volunteering: Lessons learned from Cloudera Cares first-ever Global Day of Service

Cloudera

NOVEMBER 5, 2018

Corporate volunteering is on the rise. However, only half of companies encourage their employees to participate in skills-based volunteering – defined as employees applying their abilities and specialized talents to challenges facing their communities. As the Program Manager for Cloudera Cares, Cloudera’s employee giving and volunteering program at the Cloudera Foundation, I believe that we can have more impact if we offer employees opportunities for skills-based volunteering.

Food

Food Banking Finance Programming

OLAP, what’s coming next?

Simon Späti

NOVEMBER 23, 2018

Are you on the lookout for a replacement for the Microsoft Analysis Cubes, are you looking for a big data OLAP system that scales ad libitum, do you want to have your analytics updated even real-time? In this blog, I want to show you possible solutions that are ready for the future and fits into existing data architecture. What is OLAP? OLAP is an acronym for Online Analytical Processing.

Big Data

Big Data Data Architecture Architecture Systems

More Trending

OLAP, what’s coming next?

Simon Späti

NOVEMBER 23, 2018

Big Data

Big Data Data Architecture Architecture Systems

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Data Engineering Podcast

NOVEMBER 11, 2018

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Data Lake

Data Lake Building Cloud Kafka

Zalando Research Releases “Flair”

Zalando Engineering

NOVEMBER 21, 2018

Open sourcing machine learning research for natural language processing (NLP) Two years ago, Zalando Research launched with a clear purpose to ensure that Zalando Tech is at the forefront of research in the areas of data science, machine learning, natural language processing and artificial intelligence. Our researchers’ work previously focused mainly within Zalando.

Deep Learning

Deep Learning Machine Learning Datasets Data Science

Rockset's RocksDB-Cloud Library - Enabling the Next Generation of Cloud Native Databases

Rockset

NOVEMBER 7, 2018

Rockset and I began collaborating in 2016 due to my interest in their RocksDB-Cloud open-source key-value store. This post is primarily about the RocksDB-Cloud software, which Rockset open-sourced in 2016, rather than Rockset's newly launched cloud service. In it, I will explore how RocksDB-Cloud can be used to build an open-source cloud-friendly storage system.

Database

Database Cloud Cloud Storage MySQL

Delivering Meaning with Previews on Web

Netflix Tech

NOVEMBER 12, 2018

By Corey Grunewald and Tony Casparro As the Netflix catalog of films and series continues to grow, it becomes more challenging to present members with enough information to decide what to watch. How can a member tell if a movie is both a horror and a comedy? The synopsis and artwork help provide some context, but how can we leverage video previews (trailers) to help members find something great to watch?

Utilities

Utilities Coding Management Systems

Understanding User Needs and Satisfying Them

Speaker: Scott Sehlhorst

We know we want to create products which our customers find to be valuable. Whether we label it as customer-centric or product-led depends on how long we've been doing product management. There are three challenges we face when doing this. The obvious challenge is figuring out what our users need; the non-obvious challenges are in creating a shared understanding of those needs and in sensing if what we're doing is meeting those needs.

Cloudera Named a Fastest Growing Company by Deloitte for Fourth Year

Cloudera

NOVEMBER 20, 2018

For the fourth time in the past five years, Cloudera has been named to Deloitte’s Technology Fast 500 as one of the fastest growing companies in North America. This annual ranking showcases the growth of companies in the technology, media, telecommunications, life sciences, and energy tech sectors. This year’s list demonstrated the power of combining breakthrough research and development, entrepreneurship and rapid growth, with software companies like Cloudera making up nearly two-thirds of the

Telecommunication

Telecommunication Media Cloud Technology

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Data Engineering Podcast

NOVEMBER 4, 2018

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page.

Business Intelligence

Business Intelligence Hadoop BI Data Warehouse

Digital Transformation Focused on Sustainability

Cloudera

NOVEMBER 19, 2018

My inspiration for writing this blog was a recent trip to a warehouse and distribution center of a well-known U.S. fast-food enterprise with a reputation for superior quality. During my visit, I had the opportunity to chat with the center’s Manager for Food Safety whose credentials (Ph.D. in Food Science), knowledge, and experience reflect the company’s commitment to product safety and quality.

Food

Food Big Data Machine Learning Data Integration

Train Deep Learning Models on AWS

Zalando Engineering

NOVEMBER 7, 2018

A real-life example of how to train a Deep Learning model on an AWS Spot Instance using Spotty Spotty is a tool that simplifies training of Deep Learning models on AWS. Why will you ❤️this tool? it makes training on AWS GPU instances as simple as a training on your local computer it automatically manages all necessary AWS resources including AMIs, volumes and snapshots it makes your model trainable on AWS by everyone with a couple of commands it detaches remote processes from SSH sessions it sav

Deep Learning

Deep Learning AWS Python Project

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

Data Science

Connexion 2.0 Release

Zalando Engineering

NOVEMBER 4, 2018

Today, we released Connexion 2.0 with OpenAPI 3 support. Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI Specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write a Swagger specification, then maps the endpoints to your Python functions. Besides routing, Connexion also validates requests and responses automatically based on OpenAPI specifications, handles common authentication schemes, supports API

Python

Python IT

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

Data Engineering Podcast

NOVEMBER 25, 2018

Summary When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform.

Data Lake

Data Lake Data Warehouse Hadoop BI

Netflix at AWS re:Invent 2018

Netflix Tech

NOVEMBER 26, 2018

by Shaun Blackburn AWS re:Invent is back in Las Vegas this week! Many Netflix engineers and leaders will be among the 40,000 attending the conference to connect with fellow cloud and OSS enthusiasts. You can find us at our booth on the expo floor, speaking on a variety of subjects, and at meetups and events around the re:Invent campus. We have listed all our talks below to make it easy to hear what we have been up to.

AWS

AWS Software Engineer Software Engineering Entertainment

Dynamic Typing in SQL

Rockset

NOVEMBER 1, 2018

As Peter Bailis put it in his post , querying unstructured data using SQL is a painful process. Moreover, developers frequently prefer dynamic programming languages, so interacting with the strict type system of SQL is a barrier. We at Rockset have built the first schemaless SQL data platform. In this post and a few others that follow, we'd like to introduce you to our approach.

SQL

SQL NoSQL Programming Language Bytes

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Speaker: Aarushi Kansal, AI Leader & Author and Tony Karrer, Founder & CTO at Aggregage

Software leaders who are building applications based on Large Language Models (LLMs) often find it a challenge to achieve reliability. It’s no surprise given the non-deterministic nature of LLMs. To effectively create reliable LLM-based (often with RAG) applications, extensive testing and evaluation processes are crucial. This often ends up involving meticulous adjustments to prompts.

Building

Open Source: October Review - Hacktoberfest, new releases and more.

Zalando Engineering

NOVEMBER 5, 2018

Project Highlights Connexion version 2.0 with OpenAPI 3 support is ready, check out what is new in our latest release! Connexion is the Swagger/OpenAPI first framework for Python on top of Flask with automatic endpoint validation & OAuth2 support. With 87 active contributors and more than 1,000 repositories that depend on Connexion worldwide makes this project one of the most successful open source releases of Zalando.

PostgreSQL

PostgreSQL Professional Services Media Software Engineer

An introduction to Federated Learning

Cloudera

NOVEMBER 14, 2018

We’re excited to release Federated Learning , the latest report and prototype from Cloudera Fast Forward Labs. Federated learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy and reduces communication costs. This article is about the business case for federated learning.

Manufacturing

Manufacturing Healthcare Machine Learning Medical

Tag-based Navigation of a Fashion Catalog

Zalando Engineering

NOVEMBER 28, 2018

Exploring the Zalando Assortment by Browsing a Product Similarity Graph Introduction As Europe's leading online fashion and lifestyle platform, Zalando is continually developing new features to enable our customers to find the products they want. While the standard tools of Search, Categorization & Attribute Filtering are par-for-the-course for purchasing items online, with an ever-expanding fashion assortment and an increase in the data available to describe a product, this browsing experie

Algorithm

Algorithm Computer Science Python Big Data

Zalando Postgres Operator: One Year Later

Zalando Engineering

NOVEMBER 25, 2018

Zalando Postgres operator: one year later The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

PostgreSQL

PostgreSQL Cloud Storage Cloud Computing Database

The Big Payoff of Application Analytics

Outdated or absent analytics won’t cut it in today’s data-driven applications – not for your end users, your development team, or your business. That’s what drove the five companies in this e-book to change their approach to analytics. Download this e-book to learn about the unique problems each company faced and how they achieved huge returns beyond expectation by embedding analytics into applications.

Building

Why SQL on Raw Data?

Rockset

NOVEMBER 1, 2018

Over a decade after the inception of the Hadoop project, the amount of unstructured data available to modern applications continues to increase. Moreover, despite forecasts to the contrary, SQL remains the lingua franca of data processing; today's NoSQL and Big Data infrastructure platform usage often involves some form of SQL-based querying. This longevity is a testament to the community of analysts and data practitioners who are familiar with SQL as well as the mature ecosystem of tools around

Raw Data

Raw Data SQL Unstructured Data NoSQL

Making smart cities safer with data

Cloudera

NOVEMBER 9, 2018

By Mark Micallef, Vice President of Asia Pacific and Japan , Cloudera. What comes to your mind when you think of the term “smart city”? For me, it conjures an image of a city where everything is interconnected, enabling it to run efficiently and offer convenient, secure, and personalized services to its residents at the touch of their fingertips. While such a city might sound like a utopian dream, it could potentially turn into a dystopian nightmare if we overlook the risks brought about by the

Banking

Banking Machine Learning Government Media

November, 2018

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Webinars

Trending Sources

Observability at Scale: Building Uber’s Alerting Ecosystem

Webinars

Netflix Information Security: Preventing Credential Compromise in AWS

Get Better Network Graphs & Save Analysts Time

Collaboration Between Data Science and Data Engineering: True or False?

Five strategies for skills-based volunteering: Lessons learned from Cloudera Cares first-ever Global Day of Service

OLAP, what’s coming next?

Sign up to get articles personalized to your interests!

More Trending

OLAP, what’s coming next?

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Zalando Research Releases “Flair”

Rockset's RocksDB-Cloud Library - Enabling the Next Generation of Cloud Native Databases

Delivering Meaning with Previews on Web

Understanding User Needs and Satisfying Them

Cloudera Named a Fastest Growing Company by Deloitte for Fourth Year

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Digital Transformation Focused on Sustainability

Train Deep Learning Models on AWS

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Connexion 2.0 Release

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

Netflix at AWS re:Invent 2018

Dynamic Typing in SQL

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Open Source: October Review - Hacktoberfest, new releases and more.

An introduction to Federated Learning

Tag-based Navigation of a Fashion Catalog

Zalando Postgres Operator: One Year Later

The Big Payoff of Application Analytics

Why SQL on Raw Data?

Making smart cities safer with data

Stay Connected