Open Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Jonathan Hung

Staff Software Engineer at LinkedIn

June 15, 2023

Co-authors: Jonathan Hung, Pei-Lun Liao, Lijuan Zhang, Abin Shahab, Keqiu Hu

TensorFlow is one of the most popular frameworks we use to train machine learning (ML) models at LinkedIn. It allows us to develop various ML models across our platform that power relevance and matching in the news feed, advertisements, recruiting solutions, and more. To ensure the best member experience, we want our models to be accurate and up-to-date, which requires training the models as fast as possible. However, we found that many of our workloads were bottlenecked by reading multiple terabytes of input data.

To remove this bottleneck, we built AvroTensorDataset, a TensorFlow dataset for reading, parsing, and processing Avro data. AvroTensorDataset speeds up data preprocessing by multiple orders of magnitude, enabling us to keep site content as fresh as possible for our members. Today, we’re excited to open source this tool so that other Avro and Tensorflow users can use this dataset in their machine learning pipelines to get a large performance boost to their training workloads.

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production.

Avro at LinkedIn

In general, a machine learning training pipeline requires the following steps:

Input data pre-processing
Ingesting input data from disk to memory
Training machine learning model
Model validation and post-processing

Today at LinkedIn, Avro is the primary supported storage format for machine learning training data (LinkedIn uses Apache Hadoop for much of our data processing, and Avro is a widely used serialization format in Hadoop). Users provide a schema describing their data format, and Avro provides multi-language support for reading and writing Avro data from/to disk.

Avro schemas support a wide variety of formats: primitive types (int, long, float, boolean, etc.), and complex types (record, enum, array, map, union, fixed). Avro serializes or deserializes data based on data types provided in the schema. For example, ints and longs use variable-length zig-zag encoding, and arrays are encoded via a count (the number of elements in the array), concatenated with the encoded array elements, then zero-terminated.

An Avro file is formatted with the following bytes:

Graphic of Avro file and data block byte layout

Figure 1: Avro file and data block byte layout

The Avro file consists of four “magic” bytes, file metadata (including a schema, which all objects in this file must conform to), a 16-byte file-specific sync marker, and a sequence of data blocks separated by the file’s sync marker.

Each data block contains the number of objects in that block, the size in bytes of the objects in that block, and a sequence of serialized objects.

Existing AvroRecordDataset

TensorFlow I/O contains an existing AvroRecordDataset which reads and parses Avro files into Tensors. The AvroRecordDataset itself is a tf.Dataset implementation whose associated AvroRecordDataset operation reads bytes from Avro files into memory.

AvroRecordDataset supports prefetching, parsing, shuffling, and batching via an auxiliary function make_avro_record_dataset which:

Creates an AvroRecordDataset dataset
Shuffles via the underlying tf.data.Dataset ShuffleDataset operation
Batches via the underlying tf.data.Dataset BatchDataset operation
Parses via applying the ParseAvro operation via tf.data.Dataset map operation
Prefetches via the underlying tf.data.Dataset PrefetchDataset operation

The ParseAvro operation can parse Avro data with arbitrary schemas (primitive types, and/or nested complex types such as maps, unions, arrays, etc.). It defers parsing to Avro’s GenericReader; this implementation recursively decodes the incoming bytes based on the potentially arbitrarily nested schemas (e.g. an array within a map, within a union, etc…). For complex types like arrays, it dynamically resizes the in-memory data structure which stores the parsed elements as it sequentially parses additional elements.

AvroTensorDataset API

Python API

The AvroTensorDataset supports the same features as AvroRecordDataset. Here is an example on how to instantiate AvroTensorDataset:

              Python
          

          from tensorflow_io.core.python.experimental.atds.dataset import ATDSDataset
from tensorflow_io.core.python.experimental.atds.features import \
    DenseFeature, SparseFeature, VarlenFeature

dataset = ATDSDataset(
    filenames=["part-00000.avro", "part-00001.avro"],
    batch_size=1024,
    features={
        "dense_feature": DenseFeature(shape=[128], dtype=tf.float32),
        "sparse_feature": SparseFeature(shape=[50001], dtype=tf.float32),
        # -1 means unknown dimension.
        "varlen_feature": VarlenFeature(shape=[-1, -1], dtype=tf.int64)
    }
)

      

The constructor supports the following arguments:

Argument	type	comment
filenames	tf.string or tf.data.Dataset	A tf.string tensor containing one or more filenames.
batch_size	tf.int64	A tf.int64 scalar representing the number of records to read and parse per iteration.
features	Dict[str, Union[ DenseFeature, SparseFeature, VarlenFeature]]	A feature configuration dict with feature name as key and feature spec as value. We support DenseFeature, SparseFeature, and VarlenFeature specs. All of them are named tuples with shape and dtype information.
drop_remainder	tf.bool	(Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements. The default behavior is not to drop the smaller batch.
reader_buffer_size	tf.int64	(Optional) A tf.int64 scalar representing the number of bytes used in the file content buffering. Default is 128 * 1024 (128KB).
shuffle_buffer_size	tf.int64	(Optional) A tf.int64 scalar representing the number of records to shuffle together before batching. Default is zero. Zero shuffle buffer size means shuffle is disabled.
num_parallel_calls	tf.int64	(Optional) A tf.int64 scalar representing the maximum thread number used in the dataset. If greater than one, records in files are processed in parallel. The number will be truncated when it is greater than the maximum available parallelism number on the host. If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU and workload. Default is 1.

At a minimum, the constructor requires the list of files to read, the batch size (to support batching), and dict containing feature specs. Prefetch is enabled by default and its behavior can be tuned via reader_buffer_size. Parsing happens automatically within the ATDSDataset operation. Shuffling is supported via configuring shuffle_buffer_size.

Supported Avro Schemas

Although Avro supports many complex types (unions, maps, etc.), AvroTensorDataset only supports records of primitives and nested arrays. These supported types cover most TensorFlow use cases, and we get a big performance boost by only supporting a subset of complex types (more on that later).

AvroTensorDataset supports dense features, sparse features, and variable-length features. It also supports certain TensorFlow primitives that are supported by Avro. They are represented in Avro via the following:

Primitive Types

All Avro primitive types are supported, and map to the following TensorFlow dtypes:

Avro data type	int	long	float	double	boolean	string	bytes
tf.dtype	tf.int32	tf.int64	tf.float32	tf.float64	tf.bool	tf.string	tf.string

Dense Features

Dense features are represented as nested arrays in Avro. For example, a doubly nested array represents a dense feature with rank 2. Some examples of Avro schemas representing dense features:

          "fields": [
  { 
    "name" : "scalar_double_feature", 
    "type" : "double"
  },
  {
    "name" : "1d_double_feature",
    "type" : { "type": "array", "items" : "double" }
  },
  {
    "name" : "2d_float_feature",
    "type" : { "type": "array", "items" : { "type": "array", "items": "float" } }
  }
]

      

Dense features are parsed into dense tensors. For the above, the features argument to ATDSDataset might be:

              Python
          

          {
    "scalar_double_feature": DenseFeature(shape=[], dtype=tf.float64),
    "1d_double_feature": DenseFeature(shape=[128], dtype=tf.float64),
    "2d_float_feature": DenseFeature(shape=[16, 100], dtype=tf.float32),
}

      

Sparse Features

Sparse features are represented as a flat list of arrays in Avro. For a sparse feature with rank N, the Avro schema contains N+1 arrays: arrays named “indices0”, “indices1”, …, “indices(N-1)” and an array named “values”. All N+1 arrays should have the same length. For example, this is the schema for a sparse feature with dtype float and rank 2:

          "fields": [
  {
    "name" : "2d_float_sparse_feature",
    "type" : {
      "type" : "record",
      "name" : "2d_float_sparse_feature",
      "fields" : [ {
          "name": "indices0",
          "type": { "type": "array", "items": "long" }
        }, {
          "name": "indices1",
          "type": { "type": "array", "items": "long" }
        }, {
          "name": "values",
          "type": { "type": "array", "items": "float" }
        }
      ]
    }
  }
]

      

Sparse features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:

              Python
          
          {
    "2d_float_sparse_feature": SparseFeature(shape=[16, 10], dtype=tf.float32),
}

The i-th indices array represents the indices for rank i, i.e., the Avro representation for a sparse tensor is in coordinate format. For example, the sparse tensor: tf.sparse.SparseTensor(indices=[[0,1], [2,4], [6,5]], values=[1.0, 2.0, 3.0], dense_shape=[8, 10]) would be represented in Avro via the following:

          {
  "indices0" : [0, 2, 6],
  "indices1" : [1, 4, 5],
  "values" : [1.0, 2.0, 3.0]
}

      

VarLen Features

VarLen features are similar to dense features in that they are also represented as nested arrays in Avro, but they can have dimensions of unknown length (indicated by -1). Some examples of Avro schemas representing variable-length features:

          "fields": [
  {
    "name" : "1d_bool_varlen_feature",
    "type" : { "type": "array", "items" : "boolean" }
  },
  {
    "name" : "2d_long_varlen_feature",
    "type" : { "type": "array", "items" : { "type": "array", "items": "long" } }
  }
]

      

Dimensions with length -1 can be variable length, hence variable-length features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:

              Python
          
          {
    "1d_bool_varlen_feature": VarlenFeature(shape=[-1], dtype=tf.bool),
    "2d_long_varlen_feature": VarlenFeature(shape=[2, -1], dtype=tf.int64),
}

Here, 2d_long_varlen_feature has variable length in the last dimension; for example, an object with values [[1, 2, 3], [4, 5]] would be parsed as tf.sparse.SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]], values=[1, 2, 3, 4, 5], dense_shape=[2, 3]).

Performance Optimizations

AvroTensorDataset implements a few features to optimize performance.

Operation Fusion

AvroTensorDataset fuses several TensorFlow Dataset operations: Read, Prefetch, Parse, Shuffle, and Batch, into a single ATDSDataset op.

The read step reads raw Avro bytes from a local or remote filesystem into a memory buffer.
The prefetch step provides a readahead capability in a separate producer thread. The rest of the steps act as consumers of prefetched bytes.
The parse step converts the in-memory bytes to TensorFlow Tensors. The bytes are decoded based on the provided features metadata (i.e. a column whose metadata is a DenseFeature with shape [10, 20] and dtype tf.float32 will be parsed to a 2-D tensor with dtype tf.float32).
The shuffle step shuffles the objects read into memory by the prefetch step. The prefetch step will read batch_size + shuffle_buffer_size objects into memory, and the shuffle step randomly chooses batch_size objects to parse and return.
The batch step groups/merges a group of features into single tensors, reducing the memory footprint and optimizing training performance.

Implementing these steps as separate operations introduce overhead which impacts data ingestion performance. While they can be individually multithreaded via parallel loops over the entire data ingestion pipeline, fusing them into a single operation allows for better multithreading, pipelining, and tuning.

Schema Performance

As mentioned earlier, AvroTensorDataset only supports reading Avro primitives and array types. With these types, we can already support dense, sparse, and ragged tensors, which cover the majority of use cases.

Previously, when supporting other complex types such as unions and maps, Avro schemas could get arbitrarily complicated (e.g. a record containing a map, whose values are arrays containing unions of int and long). Furthermore, since an Avro block stores objects in sequence, they are decoded in sequence, and each object must be deserialized according to the (arbitrarily complicated) schema. For a schema with lots of nested unions/maps/arrays/etc., this recursive type checking introduces a lot of overhead. We fix this by only supporting arrays and records as complex types.

Decoding arrays also introduces overhead. An array is serialized with the following bytes:

Figure 2: Avro array byte layout

It contains a sequence of blocks, where each block contains a count and a sequence of serialized array objects. Therefore, the blocks must be decoded in sequence, and we don’t know the length of the array until all of the array’s blocks are decoded. This requires us to continuously resize the in-memory data structure storing the decoded array as more blocks are decoded. We fix this by passing the array shapes to the ATDSDataset constructor, so we can pre-allocate the in-memory array without having to resize it.

Shuffle Algorithm

Another challenge with Avro is that Avro blocks do not track the offsets of each Avro object in the block. It makes it impossible to jump to a random offset and decode an object. In other words, we can only read Avro blocks sequentially. This limitation adds complexity to shuffle. If we want to shuffle Avro records within an Avro block, we have to read all records sequentially and shuffle the intermediate results. It will introduce extra copy which hurts performance.

In the ATDSDataset, we propose a shuffle algorithm that samples the number of records to read from each Avro block and merges the read objects from multiple blocks as the batched output. In this way, we can still read Avro objects sequentially without extra copy. The Avro blocks will be kept in memory until they are fully read. For example, assume three Avro blocks are loaded into memory and each block stores ten Avro objects. ATDSDataset can read one object from block 1, two objects from block 2 and 1 object from block 3 to create output tensors with batch size 4. The number of objects to read is randomly sampled. Although the algorithm does not support perfect shuffling, we do not see model performance degradation in our production models.

Thread Parallelism

The ATDSDataset constructor takes a num_parallel_calls argument which determines how many threads to use for parsing. ATDSDataset determines which blocks the next returned batch of objects belongs to (either the earliest-read blocks if shuffle is not enabled, or the blocks containing the randomly selected batch_size objects if shuffle is enabled). These blocks are split across the configured number of threads and parsed in parallel.

Block Distribution

The logic for distributing blocks across threads can also impact performance. Ideally, threads complete parsing at the same time; otherwise, multi-threading doesn’t achieve maximum speedup. To achieve this, we apply a cost-based model to estimate the time it takes to process a block, then distribute blocks by balancing cost. The cost of a block is impacted by whether it is compressed or not, and how many remaining undecoded objects it contains.

Here is an example with 8 in-memory blocks, and 4 threads. Blocks could be distributed via the following:

Graphic of Eight blocks distributed across four threads

Figure 3: Eight blocks distributed across four threads

Note that since the blocks given to threads 0 and 1 are uncompressed, these threads are given more blocks to decode compared to threads 2 and 3, and threads 0 and 1 are given (roughly) equal numbers of objects to decode.

Thread Count Auto-Tuning

Although increasing thread count can help performance, eventually it will reach a point of diminishing returns; increasing thread count too much could actually hurt performance as well, due to thread latency overhead. Furthermore, it would be wasteful to spawn six threads if there are only five blocks in memory.

num_parallel_calls supports the tf.data.AUTOTUNE parameter which will let ATDSDataset determine the appropriate number of threads when processing each batch. To do this, it chooses the thread count which will minimize estimated cost, where the estimated cost for a thread count is:

estimated_cost = (Σ block_cost) / thread_count + thread_latency_overhead

We compute the total cost of decompressing and decoding the current batch, distribute this cost among all threads, and add the thread latency overhead for this thread count. Note that increasing thread count will reduce the average cost, but increase the thread latency overhead.

Thread Parallelism Benchmarks

In our experiments, we found that increasing thread parallelism can help speed up throughput by distributing the parsing workload across threads. It is especially helpful for workloads with a large number of blocks to process on each iteration (e.g. workloads with a large batch size).

We ran benchmarks to measure I/O throughput on various thread counts (with deflate codec). The benchmark contains various dense and sparse features with different shapes and dtypes.

Chart of Throughput scaling for multi-threaded AvroTensorDataset

Figure 4: Throughput scaling for multi-threaded AvroTensorDataset

Generally, increasing threads can increase throughput, with better scaling as the batch size increases (since there is more workload to distribute among more threads). Furthermore, thread autotuning can achieve close to optimal performance.

Performance Results

AvroTensorDataset has been in production at LinkedIn for over a year as the default Avro reader for machine learning training, and has removed I/O as a training bottleneck. It improves on existing Avro data ingestion solutions by multiple orders of magnitude.

We ran a benchmark on an internal production schema on various batch sizes to compare I/O performance of AvroRecordDataset and ATDSDataset. The schema contained:

6 scalar tensors (dense tensors with rank 0)
8 dense tensors with rank 1
5 sparse tensors with rank 1

This was the average time spent in I/O per step:

	Batch size 64	Batch size 256	Batch 1024
AvroRecordDataset	40 ms/ step	160 ms /step	650 ms / step
ATDSDataset	1.2 ms / step	1.3 ms / step	4 ms / step
Improvement	33x	123x	162x

Graphic that shows AvroRecordDataset vs. AvroTensorDataset latency

Figure 5: AvroRecordDataset vs. AvroTensorDataset latency

Furthermore, we saw 35%-66% in total training time (not just I/O time) for production flows.

Conclusion

ATDSDataset is LinkedIn’s solution to efficiently read Avro data into TensorFlow. Through multiple performance enhancements, we were able to speed up I/O throughput by orders of magnitude over existing Avro reader solutions. Our team at LinkedIn worked closely with the TensorFlow I/O community to open-source this feature, and we hope that by open-sourcing it, the TensorFlow community can also benefit from these performance enhancements. For more details, please check out the ATDSDataset code on GitHub here.

Acknowledgments

Thanks to an amazing team of engineers in the Deep Learning Infrastructure team Pei-Lun Liao, Jonathan Hung, Abin Shahab, Arup De, Lijuan Zhang, and Cheng Ren for working on this project, and special thanks to Pei-Lun Liao for starting and providing technical guidance throughout this project. Thanks to the management team for supporting this project: Keqiu Hu, Joshua Hartman, Animesh Singh, Tanton Gibbs, and Kapil Surlaker. Many thanks to the support from the TensorFlow open-source community for reviewing the PR: Vignesh Kothapalli. Last but not least, many thanks to the reviewers of this blog post: Ben Levine, Animesh Singh, Qingquan Song, Keqiu Hu, and the LinkedIn Editorial team: Katherine Vaiente, and Greg Earl for your reviews and suggestions.

Topics: Open Source