How to Store Historical Data Much More Efficiently

A hands-on tutorial using PySpark to store up to only 0.01% of a DataFrame’s rows without losing any information.

Tomer Gabay
Towards Data Science
10 min readSep 10, 2023

--

Photo by Supratik Deshmukh on Unsplash

In an era where companies and organizations are collecting more data than ever before, datasets tend to accumulate millions of unnecessary rows that don’t contain any new or valuable…

--

--

Data Scientist / Machine Learning Engineer / Python Developer from the Netherlands. Writing articles and publishing open source code on a regular basis.