RAPIDS cuDF for Accelerated Data Science on Google Colab

GPU-accelerated dataframe library that implements the familiar pandas API for processing and analyzing your data.



RAPIDS cuDF for Accelerated Data Science on Google Colab
Image by Editor

 

NVIDIA GPUs have become one of the most effective ways to accelerate computationally intensive machine learning tasks. Now, thanks to RAPIDS cuDF, GPUs can also turbocharge your data analysis work.

 

What is RAPIDS cuDF?

 

RAPIDS cuDF is an open-source, GPU-accelerated dataframe library that implements the familiar pandas API for processing and analyzing your data. The Python cuDF interface is built on libcudf, the CUDA/C++ computational core that accelerates fundamental data operations from ingestion and parsing, to joins, aggregations, and more. For some workloads, you will find that switching from import pandas to import cudf accelerates your workloads and can lead to data processing speedups of 10x or more.

For example, a simple join operation can go from 761ms to 27ms simply by switching to cuDF:

 

RAPIDS cuDF for Accelerated Data Science on Google Colab

 

Getting started with RAPIDS on Colab

 

Now it’s easier than ever to get started with RAPIDS on Colab. With Colab’s default runtime update to Python 3.8 and the new RAPIDS pip packages, you can try out NVIDIA GPU-accelerated data science right in your browser. Running RAPIDS on Colab requires just two quick steps:

  1. First, select a Colab runtime that uses a GPU accelerator. Navigate to the “Runtime” menu and select “Change runtime type,” then choose “GPU” from the dropdown and click “Save.” The NVIDIA GPU that you receive from Colab may vary across sessions, — including both newer GPUs and older generations. With the new “Pay As You Go” Tier in Colab, you now have the option to upgrade your runtime to “Premium GPUs” with Colab Pro, enabling access to more powerful NVIDIA A100 or V100 Tensor Core GPUs. See Google’s blog post for more information on GPU availability.
  2. Second, install RAPIDS cuDF in your notebook. With the new RAPIDS pip packages, this step is easier than ever. Execute the following command in a code block and you will be set up to run RAPIDS. Make sure to restart your runtime after the installation completes:

 

!pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com

!rm -rf /usr/local/lib/python3.8/dist-packages/cupy*

!pip install cupy-cuda11x

 

Finally, check that import cudf completes successfully in a new code block, and then you are ready to go. If you run into any trouble, please reach out in the RAPIDS Slack and we’ll help you get things working correctly.

 

Running 10 minutes to cuDF on Colab

 

Now that you have a working cuDF installation and a GPU, you can run our tutorial notebook, “10 minutes to cuDF.” This notebook is inspired by a similar guide from the Pandas community and is a streamlined version of our full notebook, “10 Minutes to cuDF and Dask-cuDF.”

Running through the notebook, you will find examples of dataframe creation, data filtering, transformation, joins, aggregations and more. We’ve also included file reading and writing examples for Parquet, ORC and CSV formats. As you investigate more complex data processing, we hope that you use this as a companion to cuDF’s documentation.

 

Exploring the rest of RAPIDS

 

When you are ready to dive deeper , RAPIDS also includes Dask-cuDF for large workflows, cuML for scikit-learn-compatible, accelerated machine learning, and cuGraph for graph data analytics. Update your Colab notebook with the extended installation list, as shown in the following code block, and you’ll be ready to use the complete toolkit.

!pip install cudf-cu11 dask-cudf-cu11 cuml-cu11 cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com

!rm -rf /usr/local/lib/python3.8/dist-packages/cupy*

!pip install cupy-cuda11x

 

Here are some additional RAPIDS notebooks you can explore to learn more about RAPIDS:

 
 
Paul Mahler is a senior data scientist and technical product manager for machine learning at NVIDIA in Denver, CO. At NVIDIA, Paul’s focus has been on building tools that accelerate data science workflows by leveraging the power of GPU technology.

 
Original. Reposted with permission.