100+ Machine Learning Datasets Curated For You

Best Public Machine Learning Datasets for Beginners-A topic-centric list of free datasets for machine learning and data science enthusiasts.

100+ Machine Learning Datasets Curated For You
 |  BY ProjectPro

Undoubtedly, everyone knows that the only best way to learn data science and machine learning is to learn them by doing diverse projects. And honestly, there are a lot of real-world machine learning datasets around you that you can opt to start practicing your fundamental data science and machine learning skills, even without having to complete a comprehensive data science or machine learning course. But yes, there is definitely no other alternative to data science and machine learning projects. The thing most data science and machine learning beginners do wrong is they just stay focused on learning a lot of theoretical concepts and wait for too long to start a machine learning/data science project that focuses on the practical implementation of that concept.

ProjectPro Free Projects on Big Data and Data Science

No doubt, it is always good to have clarity on your machine learning concepts theoretically but without getting relevant practical exposure you cannot expect to become an enterprise data scientist or a machine learning engineer. Here in this blog, we’ll provide you with over 100 worthwhile datasets for machine learning, particularly for beginners, that will surely help validate your basic data science and machine learning skills.

What is a dataset in machine learning?

A dataset in machine learning is a collection of instances (instance refers to a single row of data) that all share some common features and attributes. For a machine learning model to perform different actions, two kinds of datasets are required –

  1. Training Dataset - The data that is fed into the machine learning algorithm for training.

  2. Test Dataset or Validation Dataset – The data that is used to evaluate and test that the machine learning model is interpreting accurately.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Why you need machine learning datasets?

Machine learning algorithms learn from data. A machine learning algorithm identifies trends, relationships, and makes predictions based on large volumes of data given to train the models. Thus, data is the golden goose in machine learning. The insights gleaned from machine learning models are only as good as the dataset. Having large and better training data for a machine learning project leads to better and accurate model performance. Reliable machine learning datasets are extremely important and play a vital role in the development of accurate machine learning models.


Time Series Project to Build a Multiple Linear Regression Model

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Where can I find datasets for machine learning?

There are tons of free and paid resources available for machine learning datasets. The most popular resources for public machine learning datasets to help you get started include–

However, for data science and machine learning beginners, it can become quite overwhelming to choose from the plethora of options available on these websites. If you’re trying to learn machine learning, you need a strong foundation, which means interesting datasets for machine learning projects and some cool project ideas to work with these free datasets. Wondering where to find free and public datasets for machine learning? Look no further…Whether it’s Retail, Healthcare, Banking & Finance, Crime, or, really any other kind of machine learning dataset, we’ve curated a list of top machine learning datasets on everything to help you make your models successful.

Recommended Reading: 

Here's what valued users are saying about ProjectPro

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good theoretical knowledge, the practical approach, real word application, and deployment knowledge were...

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills in Data Engineering/Science and hoping to find real-world projects fortunately, I came across...

Ed Godalle

Director Data Analytics at EY / EY Tech

Not sure what you are looking for?

View All Projects

100+ Machine Learning Datasets for Data Science and Machine Learning Practitioners

Machine Learning Datasets

We've aggregated a domain-centric list of top machine learning datasets with a short description of the data and the projects that you can work with using a specific dataset.

  • Retail Machine Learning Datasets

  • Healthcare Machine Learning Datasets

  • Banking and Finance Machine Learning Datasets

  • Social Media Machine Learning Datasets

  • Crime Machine Learning Datasets

Click here to view a list of 200+ solved, end-to-end Big Data and Machine Learning Project Solutions (reusable code + videos)

Best Retail Datasets for Machine Learning

Retail Datasets for Machine Learning

Retail Transactional Machine Learning Datasets

1) Online Retail Dataset (UK Online Store)

​​If you are keen on preprocessing large retail datasets, you might want to look up the UK based online company’s transactional data that sells unique all-occasion gifts. With over 500,000 rows and 8 attributes, classification and clustering are the most common associated machine learning tasks that can be performed with this dataset.

 Download Online Retail Dataset for Machine Learning

 Interesting Machine Learning Project Idea using UK Online Retail Dataset– Perform Market Basket Analysis to identify the association rules between the products.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

2) Retail Rocket Recommender System Dataset

This dataset consists of clickstream data of a real-world eCommerce website that has information about customer behavior such as add to cart info., transactions, and clicks along with information on different item properties for 417053 unique items. The dataset has an events data file with information about the events a user performs (add to cart, transaction, or view) for a product at a specific timestamp. The “transaction-id” column in the events data file has a value only if a transaction is made by the user else it will be N/A.

Download Retail Rocket Recommender System Dataset for Machine Learning

Machine Learning Project Idea using Retail Rocket Machine Learning Dataset - Build a Recommender System to predict the transaction and event pattern of a visitor.

3) Instacart Orders Dataset for Machine Learning

With over 3 million grocery orders for over 200,000 Instacart anonymized customers this is another interesting machine learning dataset to work with large retail data. For each customer, the dataset consists of data for 4 to 100 orders in the order in which the products are purchased along with the week and hour of the day orders were placed. XGBoost, Word2Vec and Annoy are the machine learning algorithms revolutionizing the way Instacart customers buy groceries today.

Download Instacart Orders Kaggle Dataset

Machine Learning/Data Science Project Ideas for Beginners using Instacart Dataset

  • Customer Segmentation – Build an association-based machine learning model to understand the diverse mix of Instacart customers and target the right customer segments to maximize profitability.
  • Market Basket Analysis – Develop a predictive market basket analysis machine learning model to identify which products will an Instacart customer purchase together again?

4) Brazilian e-Commerce Dataset by Olist

This machine learning dataset consists of data for 100K customer orders at Olist store with particulars on seller information, product metadata, customer information, and customer reviews.

Download Brazilian E-commerce Public Kaggle Dataset by Olist

Data Science/Machine Learning Project Ideas using Brazilian e-commerce Dataset

5) Supermarket Dataset for Machine Learning

With over 1000 rows and 17 columns, this retail dataset has historical sales data for 3 months of a supermarket company with data recorded at three different branches of the company. This retail dataset is a perfect choice for any kind of predictive analytics projects.

Download Supermarket Kaggle Dataset for Machine Learning

Retail Image Datasets for Machine Learning

6) MVTec Densely Segmented Supermarket Image Dataset

With a limited amount of training data and high diversity in the validation and test sets, this is a challenging image dataset for machine learning to work with. It has 21K high-resolution images of everyday products and groceries acquired in 700 different scenes with pixel-wise labels of all object instances in industry-relevant settings with high-quality annotations.

Download MVTec D2S Retail Dataset for Machine Learning

Computer Vision Project Ideas using the MVTec D2S Dataset

This retail dataset can be used for semantic image segmentation to cover the real-world application of an automatic checkout, warehouse, or stock inventory system. The classic deep learning CNN machine learning algorithm works best in classifying the products in the images at a pixel level to simplify the checkout process.

Free access to solved machine learning Python and R code examples can be found here (these are ready-to-use for your projects)  

7) Common Objects in Context (COCO) Dataset

With a total of 330K images, over 200K labeled 91 stuff categories, 80 object categories, 1.5 million object instances, and 250,000 people with key points -the COCO dataset is one of the most popular and challenging high-quality datasets for computer vision. This dataset represents images of diverse objects that we encounter in our day-to-day life and is considered a perfect checkpoint for transfer learning. It is the base dataset for training computer vision models. Once any computer vision model has been trained using the COCO computer vision dataset, you can use any custom dataset to further fine-tune the model to learn other tasks.

Download COCO Dataset for Machine Learning

What kind of Computer Vision projects you can work with using the COCO dataset?

Object Detection - Use the COCO dataset to perform one of the most challenging computer vision tasks of predicting where different objects are present in an image and what kind of objects are present.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

8) Freiburg Groceries Dataset

The Freiburg Groceries retail dataset consists of 5000 images with 25 different classes of groceries with each class having a minimum of 97 images that have been captured in real-world settings at various departments of different grocery stores.

Download Freiburg Groceries Dataset

Computer Vision Project Ideas using Freiburg Groceries Dataset

You can build a computer vision model based on multi-class object classification for grocery products. This model can further be fine-tuned for building a friction-less store experience just similar to the popular Amazon Go store with no manual checkout needed.

9) Fashion MNIST Dataset

With 10K testing examples, 60K, training examples, and 10 categories of retail products with a resolution of 28x28 grayscale channel images, this is one of the best alternatives to the MNIST dataset used in deep learning and computer vision. However, this is slightly more challenging than its drop-in replacement.

Download Fashion MNIST Kaggle Dataset

Computer Vision Project Ideas using Fashion MNIST Dataset

Use this dataset to enjoy your first taste of clothing classification by training a simple CNN with Keras or TensorFlow to build a model from scratch. You can look up this dataset if you want to practice a methodology for solving image classification problems using CNN machine learning algorithms.

10) Retail Product Checkout Dataset

With over 500,000 images of retail items on store shelves from 2000 different product categories - this is one of the largest retail image datasets both in terms of product categories and product image quantities.

Download a Large-Scale Retail Product Checkout Kaggle Dataset

Computer Vision Project Ideas using RPC Dataset

This dataset is widely used to advance research in retail product image recognition for automatic shelf auditing and checkout. The high-quality nature of this dataset makes it ideal for fine-grained retail product image classification.

Want to develop your data science and machine learning skills? Check out our latest end-to-end data science and machine learning projects with source code

Customer Reviews Retail Datasets for Machine Learning

11) Amazon Customer Reviews Dataset

With over 130+ million customer reviews on millions of products from 1995 to 2015, this machine learning dataset is a boon for data scientists and researchers in the field of machine learning, natural language processing, and information retrieval for understanding customer experiences.

Download Amazon Customer Reviews Dataset

12) Women’s E-Commerce Clothing Reviews Dataset

This is an anonymized dataset as it contains reviews written by real customers and has 23486 customer reviews with 10 different feature variables. This ML dataset provides a fantastic environment for parsing text in multiple dimensions.

Download Women’s E-Commerce Clothing Reviews Dataset

13) IKEA Reviews Dataset for Machine Learning

It’s a rather small machine learning dataset of 1300 best and worst IKEA customer reviews scraped from Google Maps. This makes for a perfect beginner level dataset for sentiment analysis.

Download IKEA Reviews Kaggle Dataset

 14) Amazon and Best Buy Electronic Product Reviews Dataset

This dataset specifically has over 7000 online reviews for 50 electronic products available on Best Buy and Amazon. The dataset consists of date of review, title, rating, source, metadata, and other information.

Download Amazon and Best Buy Electronic Product Reviews Dataset

15) Multi-Domain Sentiment Dataset

This is a multi-domain dataset consisting of product reviews from many product types. 100K+ reviews of diverse products including musical instruments, books, DVD’s from Amazon.com with a rating between 1 to 5.

Download Multi-Domain Sentiment Kaggle Dataset

Interesting Machine Learning Project Ideas using the Customer Review Datasets

  • Predict ratings based on the content of the customer reviews using NLP
  • Study the customer feedback impact on the product buying process. You can use these review datasets to predict the probability of a customer recommending the products to their friends.
  • Study the online reputation of various brands.
  • Perform sentiment analysis of the customer reviews to identify a user’s emotion towards a product – positive, negative or neutral. (Review Sentiments)

ProjectPro helps students learn practical skills by building end-to-end real-world data science and machine learning projects. Check out some cool and interesting machine learning project ideas for students with source code.

Other Retail Datasets for Machine Learning

16) Innerwear Data from Victoria's Secret and Others

This dataset has data on 600K+ innerwear products from popular retail sites like Amazon, Victoria’s Secret, Hanky Panky, Macy’s, Btemptd, Nordstrom, American Eagle, and others.

Download Innerwear Data from Victoria's Secret and Others Kaggle Dataset

Machine Learning Project Idea Using Innerwear Kaggle Dataset:

This dataset can be used to analyze the fashion trends of swimwear and innerwear products.

17) eCommerce Item Data

The machine learning dataset contains 500 SKU’s along with the descriptions of the products from an apparel brand’s product catalog.

Download eCommerce Item Kaggle Dataset

Machine Learning Project Idea Using Ecommerce Item Kaggle Dataset:

An interesting machine learning project you can work on using the Item data is to build a product recommender system.

18) eBay Online Auctions Dataset

This online auction retail dataset consists of auction information such as the bid rate, bid time, auction price of the item, and other auction information about Swarovski beads, Cartier Wristwatches, Xbox Game Consoles, and Palm Pilot M515 PDA’s.

Download eBay Online Auctions Dataset

Machine Learning Project Idea Using Online Auctions Kaggle Dataset:

Build a machine learning model to predict the end price of an auction item. Predicting the end price of an auction item is beneficial to both buyers and sellers with a view of profit maximization.

19) Walmart Dataset

This is one of the best beginner-level machine learning datasets as it has the most retail data along with external data in the region of each Walmart store such as Unemployment rate, fuel prices, CPI making it a perfect choice for detailed analysis. This Kaggle dataset contains anonymized historical sales data across 45 Walmart stores recorded from 2010 to 2012.

Download Walmart Store Sales Kaggle Dataset

Machine Learning/Data Science Project Ideas using Walmart Retail Dataset

Build a machine learning model to predict the department-wide sales of Walmart considering factors like Holiday and Markdown events, consumer price index, season changes, and other factors that impact the sales of the products. Sales forecasting models help companies sketch a plan on how to meet future demands and increase sales.

20) Men’s Shoe Price Dataset

This dataset consists of a large collection of 10,000 men’s shoes along with the prices at which they are sold, the brand name, shoe name, and other information.

Download Men’s Shoe Price Dataset

Machine Learning/Data Science Project Ideas using Shoe Price Dataset

Build machine learning models using this pricing data to -

  • Determine the brand value of a luxury brand
  • Determine pricing strategies
  • Identify trends for luxury men’s shoes
  • Identify the correlation between the specific features of a shoe with changes in price.

Explore 54 Other Retail Datasets for Machine Learning

Best Healthcare Datasets for Machine Learning

Healthcare Datasets for Machine Learning

1) OSIC Pulmonary Fibrosis Progression

Open Source Imaging Consortium healthcare dataset consists of 200 anonymized, baseline CT scan of lungs along with other associated clinical information like baseline Forced Vital Coefficients, the gender of the patients, age, the relative number of weeks post the baseline scan, Smoking Status, etc.  

Download OSIC Pulmonary Fibrosis Progression Dataset

Data Science/Machine Learning Project Idea using OSIC Kaggle Dataset

You can build a machine learning model to predict a patient’s severity of the decline in lung function.

2) APTOS 2019 Blindness Detection

This is a diverse and expansive dataset of fundus photography retina images captured under various imaging conditions. Each image is clinically rated on a scale of 0 to 4 based on the severity of diabetic retinopathy.

Download APTOS 2019 Blindness Detection Kaggle Dataset

Machine Learning Project Idea using APTOS Dataset

1/3rd of 285 diabetic million people have signs of diabetic retinopathy. You can use this dataset to build a machine learning model that will detect DR much before it causes complications that affect eyes which would help millions of diabetic people from losing their vision.

3) Ultrasound Nerve Segmentation Dataset

This Kaggle dataset consists of 5635 images where the nerves have been manually annotated by humans. It is one of the challenging machine learning datasets to work with as it has reduced data size and no obvious structural features.

Download Ultrasound Nerve Segmentation Dataset

Access this Machine Learning Project with Source Code to build a machine learning model that identifies nerve structures in Ultrasound images to segment a collection of nerves known as Brachial Plexus (BP).

4) Parkinson Dataset

This a very small healthcare dataset to work with approximately 39 KB and has a range of biomedical voice measurements from 31 people of which 23 have Parkinson’s disease.

Download Parkinson Dataset from UCI Machine Learning Repository

Machine Learning Project Idea using Parkinson Dataset

Over 1 million people in India every year are affected by Parkinson’s disease. This disease is chronic and has no cure and even difficult for doctors to diagnose at an early stage. You can build a machine learning model to accurately detect the early onset of Parkinson’s disease in an individual and find out if a patient with Parkinson’s disease is healthy or not based on several factors.

5) Intel & MobileODT Cervical Cancer Dataset

This Kaggle dataset consists of 1481 training images and 512 test images. Considering the limited nature of this dataset, you might have to apply various data augmentation techniques to increase the number of training samples.

Download Intel & MobileODT Cervical Cancer Dataset

Deep Learning Project Ideas using Intel and Mobile ODT Cervical Cancer Dataset

Cervix Type Classification using Deep Learning and Image Classification - Cervical cancer is deadly but if detected at an early stage and administered with appropriate treatment can be a lifesaver for many women. You can use this Kaggle dataset to build a deep learning model to classify cervix types (Type 1, Type 2, and Type 3) to help healthcare professionals provide better care to women across the globe. Classifying the cervix types will help healthcare providers enhance the efficiency and quality of cervical cancer screening for women.

6) Breast Histopathology Images Dataset

The actual dataset consists of 162 slide images of breast cancer specimens. From this dataset 277,524 patches have been extracted among which 78786 belong to the positive class while the remaining 198, 738 patches belong to the negative class.

Download Breast Histopathology Images Dataset

Deep Learning Project Ideas using Breast Histopathology Image Dataset

Breast cancer is the most common type of cancer with 627,000 death reports among 2.1 million diagnosed breast cancer cases in 2018. 80% of all the diagnosed breast cancer cases fall under the invasive ductal carcinoma (IDC) type of breast cancer. Early accurate diagnosis of cancer helps choose the right treatment plan and helps increase the survival rate of cancer patients. You can use this dataset to build a deep CNN for image classification to identify the presence of IDC in unlabelled histopathology images. This is an important clinical task and an automated model for this will definitely save time and reduce error.

7) Mini DDSM Dataset

One of the largest(45GB) public mammography datasets that have the age attribute, density attribute, patient’s original filename, cancer lesion contour binary mask image, and an excel sheet with all the metadata needed.

Download Mini DDSM Kaggle Dataset

Machine Learning Project using Mini DDSM Dataset

Age estimation has diverse clinical applications and several studies have been conducted on human age estimation using biomedical images. Using this dataset, you can build an AI-based model for estimating age based on the pectoral muscle segments in the mammogram images. The foremost step is to segment the pectoral muscles from the mammogram images and then extract deep learning features to build a model for age estimation.

8) Cleveland Heart Disease Dataset

Cleveland Heart Disease UCI dataset consists of data for 303 individuals with 75 attributes of which 14 attributes like Age, Gender, Resting Blood Pressure, Serum Cholesterol, Resting ECG, Max Heart rate achieved, Exercise-induced angina, and other important parameters that could be major risk factors in developing cardiovascular disease.

Download Heart Disease Dataset

Machine Learning Project Idea using Heart Disease Dataset

Heart disease is a major cause of mortality and morbidity worldwide with 610,000 deaths reported in the US alone every year. Determining the odds of getting cardiovascular disease is difficult to predict manually based on the risk factors. This is where machine learning can be of great help in making predictions from the large quantity of data produced by the healthcare industry. You can apply various machine learning algorithms like SVM, Naïve Bayes, XGBoost, Decision Trees, Random Forests and compare them for predicting whether a person is suffering from heart disease or not using the Cleveland heart disease machine learning dataset.

Get More Practice, More Data Science and Machine Learning Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

9) Mechanisms of Action Prediction Dataset

This a unique machine learning dataset that consists of cell viability data and gene expressions with access to MoA annotations for over 5K drugs. This dataset for machine learning is based on a novel technology which measures the response of human cells to drugs in a pool of hundred diverse cell types thereby eliminating the problem of identifying which cell types are better suited for any given drug.

Download Mechanisms of Action (MoA) Prediction Kaggle Dataset

Machine Learning Project Idea using MoA Prediction Dataset

Drug discovery plays a vital role in the advancement of disease treatment. Machine learning is being extensively used to understanding the underlying mechanism of a disease, clinical markers, drug discovery, and validation. This dataset can be used to advance drug development by developing machine learning algorithms to classify drugs based on their biological activity.

10) WHO -A World of Healthcare Machine Learning Datasets

The most trustworthy and authentic source of healthcare data for different nations. With data and analysis on COVID -19 to specific diseases like Cholera, Tuberculosis, Influenza, and other diseases- WHO has data on global health priorities and trend highlights for most of the health conditions.

Download Healthcare Datasets for Machine Learning from WHO Repository

 Access Data Science and Machine Learning Project Code Examples

Other Cool and Interesting Machine Learning Project Ideas to work with Healthcare Data

  • Lung Segmentation
  • Diabetes Prediction
  • Contact Tracing to stop the spread of infectious disease
  • Cancer Classification
  • Personalized Medicine
  • Predict Chronic Diseases
  • Predict disease outbreaks
  • Classify image data (X-rays, CT scans, etc.) for diagnostic care.

Check Out 299 other healthcare datasets for machine learning

Best Banking and Finance Machine Learning Datasets

Banking and Finance Datasets for Machine Learning

1) Santander Datasets

As this is a banking dataset it has been completely masked and contains only numerical values. There are four different datasets provided by Santander, an online bank in Spain to help them solve various business challenges using machine learning.

Download Santander Customer Transaction Dataset

Download Santander Value Prediction Dataset

Download Santander Product Recommendation Dataset

Download Santander Customer Satisfaction

Ideas for Machine Learning Project using Santander Transaction Dataset

These Santander Bank datasets can be used to build end-to-end machine learning models to -

  • Predict if a customer will make a transaction with the bank in the future regardless of the amount of money transacted.
  • Predict if a customer will buy a product
  • Predict if a customer is capable of paying the load
  • Predict if a customer is satisfied with the services of the bank.

2) Home Credit Default Risk Dataset

This dataset consists of 7 different sources of data for a customer -loan application data, bureau data, credit card balance data, previous loan applications data,  POS cash balance data,  EMI payments data, and bureau balance data.

Download Home Credit Default Risk Kaggle Dataset

Machine Learning Project Ideas using Home Credit Default Risk Kaggle Dataset

Build a machine learning model to predict if a customer is capable of repaying a loan or not. These models will help banks take a decision on sanctioning loans only to those applicants who are capable of repaying the loan.

3) Bank Turnover Dataset

This dataset contains 14 features for about 10K customers of a bank of which 20% of them are churn customers.

Download Bank Turnover Dataset

Machine Learning Project using Bank Turnover Dataset

This dataset can be used for predicting customer churn, one of the most common applications of machine learning. You can build a machine learning model to predict if a customer will quit the services of the bank in the next 6 months or not. Predicting customer churn will help banks develop retention campaigns and loyalty programs to retain customers.

Evolution of Machine Learning Applications in Finance : From Theory to Practice

4) Credit Card Transactions Dataset

This European credit card dataset consists of 284, 807 transactions with 492 fraudulent transactions (0.172% of all transactions) that occurred over a period of two days in September 2013. It is a challenging dataset to work with as it has imbalanced data because most of the transactions are not fraudulent making it difficult to detect the fraudulent ones.

Download Credit Card Fraud Transaction Kaggle Dataset

Machine Learning Project using Credit Card Transactions Dataset

Credit card fraud is a common problem for many banks and credit card companies because most of the fraudulent transactions look similar to normal transactions and a huge number of transactions completed on credit cards each day making it difficult to detect fraud manually. Use this finance machine learning dataset to identify fraudulent credit card transactions to make sure the customers are not charged for transactions they did not make.  

5) Give me Some Credit Dataset

This dataset consists of historical data created in 2008  for 250K Brazilian borrowers that can be leveraged by a financial institution to predict the credit score and make the best financial decisions.

Download Give me Some Credit Kaggle Dataset

Machine Learning Project Ideas using Give me Some Credit Dataset

Build a machine learning model to predict the probability that a person will experience financial distress in the next two years.

6) Two Sigma Dataset

This dataset consists of two data sources namely Intrinio and Thomson Reuters.  The training market data provided by Intrinio has approximately 4 million rows while the training news analytics data provided by Reuters has close to 9 million rows making it one of the largest datasets to work with for stock price prediction.

Download Two Sigma Dataset

Interesting Machine Learning Project Ideas using Two Sigma Kaggle Dataset

Stock prices are usually determined by the behavior of investors while investors determine the stock prices based on public information to predict how the stock market will react. This is where finance news articles play a vital role in influencing the prices of stocks as investors react to this information. This dataset can be used to build a machine learning model for categorizing the news articles related a list of companies and predict the movement of stock prices for those companies based on that.

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

7) Bitcoin Historical Dataset

This dataset consists of select bitcoin exchanges data from  January 2012 to December 2020 with minute by minute updates on Open, High, Low, and Close along with weighted bitcoin price, volume in BTC, and indicated currency.

Download Bitcoin Historical Dataset

Sample Machine Learning Project Idea using Bitcoin Historical Dataset

Use this Kaggle Dataset to build a machine learning model to predict the Bitcoin prices of tomorrow. One can explore the use of LSTM model for predicting Bitcoin prices.

8) Jane Street Market Dataset

If you like machine learning projects or think you want to explore some good stock market data, this dataset could be a golden opportunity to work with. It contains real stock market data with anonymized features where each row in the dataset represents a trading opportunity.

Download Jane Street Market Prediction Dataset

Suggested Machine Learning Project using Jane Street Market Prediction Dataset

Use the Jane Street stock market data to build a quantitative trading machine learning model to maximize returns using real stock market data from the global stock exchange. You can also test the effectiveness of your machine learning model against future real stock market data.

9) Elo Merchant Category Recommendation

Elo is a large Brazilian payment brand that provides restaurant recommendations to its debit and credit card users with discounts based on their preferences. This dataset contains information about every card transaction with data on the worth of transactions for each card for up to 3 months for a specific merchant,  details of transactions at new merchants for each card, and other merchant data based on the various merchants involved in the card transactions.

Download Elo Merchant Category Recommendation Dataset

Suggested Machine Learning Project for Elo Merchant Category Dataset

This dataset can be used to find how beneficial are these promotions are for customers as well as merchants. Build a machine learning model to predict the loyalty score of customers and help Elo understand customer loyalty so they can cut down on unwanted marketing campaigns and create the right experience for its users.

10) Sberbank Russian Housing Market Dataset

This training data for this dataset has information on 21K reality transactions of Russia’s oldest and largest bank, Sberbank while the test data has 7K reality transactions along with other information on the property.

Download Sberbank Russian Housing Market Kaggle Dataset

Machine Learning Project Idea using Sberbank Russian Housing Market Dataset

Use this rich banking dataset to develop a machine learning model to predict real house prices so developers, lenders, and renders have confidence when purchasing a property or signing a lease. This data also includes information on Russia’s economic and financial sector that can be helpful in developing an accurate model without needing a second guess.

Explore 100’s other premier financial and economic datasets.

Social Media Datasets for Machine Learning

Social Media Public Datasets for Machine Learning

1) Twitter US Airline Sentiment Dataset

This social media dataset has 14,640 rows and 12 attributes and consists of Tweets scraped from Twitter for each major US airline.

Download Twitter US Airline Sentiment Dataset

Suggested ML Project Idea: Sentiment Classification System Using Machine Learning

You can use this dataset to classify airline tweets as positive, negative, or neutral to analyze travelers' feedback about the airline.

2) Google Cloud and YouTube 8M Dataset

A dataset developed by the Google AI/Research in 2016 with 8 million YouTube videos (a total of 500K hours) and 4.8K ( an average of 3.4 labels per video) visual titles.

Download YouTube 8M Dataset

Data Science and Machine Learning Project Ideas using YouTube 8M Dataset

  • Build a compact video classification with a model size of less than 1GB to learn video representations. This will help advance video-level annotations.
  • Build a classification machine learning model to accurately assign video labels.

3) COVID-19 Tweets Dataset

This is a multi-language tweets dataset of over 1 billion tweets containing keywords like coronavirus, virus, covid, ncov19, ncov2019 with hashtags, mentions, topics and other information.

Download COVID19 Tweets Dataset

Suggested ML Projects using COVID 19 Dataset

Use data mining, network analysis, and NLP to analyze a corpus of tweets from this dataset to identify the response of people to the pandemic and how the responses differ with time. You can also leverage this ML dataset to glean insights on how the right information and misinformation were transmitted during the early days of this pandemic.

4) Yelp Dataset

This dataset consists of 5,200,000 reviews with information on 1,74, 000 businesses across 11 areas in 4 countries.

Download Yelp Kaggle Dataset

What projects you can do with this dataset for machine learning?

Use NLP and sentiment analysis to find what’s in a review positive or negative and infer the meaning of various sentiments and business attributes.

Access Data Science and Machine Learning Project Code Examples

5) Customer Support on Twitter

A dataset of 3 million tweets from the top brands on Twitter.

Download Customer Support on Twitter Dataset

What projects can I do using this ML dataset?

Build machine learning models to –

  • Understand how the tone affects the customer support conversation and does an apology or saying sorry help.
  • How quickly do the top companies respond to customer requests compared to the worst?

Crime Datasets for Machine Learning

Crime Datasets for Machine Learning

1) San Francisco Crime Classification

This is a historical dataset that consists of 12 years of crime reports between 2003 to 2015 from San Francisco. The data consists of information on day of the week the crime occurred, time of the crime , description of the crime, district, address, location coordinates, and resolution.

Download San Francisco Crime Classification Dataset

ML Project Idea using the Crime Classification Kaggle Dataset

Build an end-to-end machine learning model to predict the category of crime events based on the location and time of occurrence of the event.

2) London Crime Dataset

This dataset consists of criminal reports between January 2008 to December 2016 by LSOA borough, month, and minor/major category with 13 million rows on crime count.

Download London Crime Dataset

Suggested Projects using the London Crime Kaggle Dataset

This data can be used to analyze if there are any changes in crime occurrences based on the day of the week or season or identify boroughs where specific crimes are decreasing or increasing.

3) Crime In India

This dataset consists of complete information on state-wise crime data from 2001 classified across 40+ factors.

Download Crime in India Dataset

Suggested Projects for Analytics using this Dataset

This dataset can be used to analyze crime patterns in India such as child abuse cases, a crime against SC and ST’s, and other crimes, to detect potential criminals based on the crime patterns.

4) Chicago Crime Dataset

The Chicago Crime Dataset that comes from the Chicago Police Department has 6.99 million rows with 22 attributes. This dataset is updated continuously with incidents of crime.

Download Chicago Crime Dataset

Machine Learning Project Ideas using Chicago Crime Dataset

This dataset can be leveraged to build models for analyzing the effect of temperature on violent crimes like battery or assault, identifying the category of crimes that saw the highest year-on-year increase, etc.

5) Crime in Boston Dataset

The dataset is provided by the Boston Police Department that contains information from June 2015 on the type of crime, when and where the crime occurred, crime description, location coordinates, and other information.

Download Crime in Boston Dataset

This dataset can be used to build a model that identifies the criminal hotspots and the frequent occurrence time of the crimes.

Still, craving for datasets for your data science and machine learning projects?  Stay tuned to this page for updates on many more interesting machine learning datasets.

Next Steps

What do you do once you’ve handpicked some machine learning datasets? Identify a use case with a proven ROI and get started working with the dataset. There are certain steps you’ll need to follow once you’ve access to a workable machine learning dataset beginning with exploratory data analysis. Implementing these datasets into production-ready machine learning models can seem daunting, but ProjectPro’s industry expertise promises to make this process easier.  We host 50+ data science and machine learning projects that help data specialists to easily deploy and manage machine learning models in production. At ProjectPro, we can help you learn how to build end-to-end data science and machine learning projects using these datasets right from data preparation, data cleaning, data exploration, model training, model evaluation, and testing to data visualization. See how ProjectPro can help you build better machine learning models for your business use case.

Click here to view a list of 200+ solved, end-to-end project solutions in Machine Learning and Big Data

 

PREVIOUS

NEXT

Access Solved Big Data and Data Projects

About the Author

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author arrow link