Specialized in real-world AI data — annotation, evaluation, and domain expertise for systems that operate beyond the lab.

Specialized in real-world AI data

Industries

Blog

Case Studies

About Us

Start a Free POC

Top 25 Twitter Training Datasets for Data Scientists (Free)

Sep 26, 2023

In recent times, the social media landscape has undergone significant changes, including Twitter’s transformation after Elon Musk’s acquisition, where it’s now simply called “X.” As the industry buzzes with rumors about restricted data access and API changes, AI and ML professionals are on the hunt for reliable sources of Twitter training datasets to fuel their data-hungry models.

Fortunately, there are a ton of great open-source Twitter (now “X”) training datasets that were collected and shared going back the past decade or so. We have collected our favourites so you don’t have to go looking around the internet yourself! You can also check out our 12 best natural language datasets that are free.

Types of Twitter Training Datasets

Before we explore the datasets, let’s categorize them based on their content and utility:

Dataset Consisting of Tweets Related to Covid-19 Pandemic: As the pandemic unfolded, Twitter became a hub for discussions. This dataset comprises over 150 million tweets, spanning multiple languages.
General Twitter Dataset Consisting of Random Tweets: A random collection of tweets sourced from Twitter, providing a diverse set of text data for various NLP tasks.
User-Only Features vs. Tweet-Only Features in a Twitter Dataset: Some datasets focus on user-related data, including usernames, while others emphasize tweet content.
Relevant Tweets Extracted from Social Networks via Logistic Regression Techniques: Datasets that employ advanced techniques to extract tweets of specific relevance.
Feature Sets for Model Training and Deep Learning Models in Twitter Dataset Collection: Datasets designed to serve as feature sets for training advanced machine learning models.

Top 25 Twitter Datasets

Now, let’s explore the top 25 Twitter datasets that are invaluable for sentiment analysis, content moderation, and various other AI applications:

16 Million Unfiltered Tweets: A compilation of 16 million tweets from January 23rd to February 8th, 2011, including both important and spam tweets.
2016 Presidential Election: Originally compiled for transparency during the 2016 presidential election, this dataset focuses on related tweets.
Apple Twitter Sentiment: Focusing on tweets related to Apple, this dataset includes the #AAPL hashtag and @apple references, with tweets classified as Positive, Negative, or Neutral.
Avengers Endgame Tweets: This dataset includes over 10,000 records related to the hit film “#AvengersEndgame” from 2019.
Charlottesville on Twitter: Focusing on 150,000 tweets related to the Unite the Right rally in Charlottesville.
COVID-19 Tweets: This Twitter dataset contains 150+ million tweets related to the COVID-19 global pandemic, covering multiple languages with a focus on English, Spanish, and French.
Credibility Corpus in French and English: Designed to detect misinformation, this dataset consists of French and English tweets related to rumors.
Customer Support on Twitter: This extensive dataset includes customer service interactions on Twitter, along with corresponding tweets and replies.
Elon Musk’s tweets dataset: This dataset is a collection of Elon Musk’s tweets from 2010-06-04 to 2017-04-05.
Every Donald Trump Tweet: A compilation of all tweets posted by Donald Trump, accessible from thetrumparchive.com.
Game of Thrones Season 8 Tweets: A collection of tweets reflecting Twitter feedback after each episode of Game of Thrones Season 8.
Gender Classifier Twitter Dataset: This data set was used to train a CrowdFlower AI gender predictor. You can read all about the project here. Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual).
MovieTweetings dataset: This dataset consists of ratings on movies that were contained in well-structured tweets on Twitter. This dataset is the result of research conducted by Simon Dooms.
Pre-processed Twitter Tweets: These tweets have been categorized into positive, negative, and neutral for sentiment analysis.
Sentiment 140: This dataset from Sentiment 140 is useful for analyzing sentiments around specific topics, brands, or products on Twitter.
SMILE Twitter Emotion: Ideal for sentiment analysis, this dataset contains over 3,000 tweets expressing various emotions.
Stanford SNAP Twitter Dataset: With over 476 million tweets from 20 million users spanning a 7-month period, this dataset comes from the SNAP library database at Stanford University.
Top 20 Most-Followed Users on Twitter: Comprising 52,000 tweets from the top 20 Twitter profiles, excluding retweets.
TweetEval Sentiment Classification Dataset: This dataset consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include – irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.
Twitter Airline Sentiment: Focused on tweets related to major US airlines, categorized into positive, neutral, and negative sentiment.
Twitter Friends: This dataset provides information about avatars, friend counts, User IDs, follower counts, and more.
Twitter News Dataset: Focusing on 5,234 news events and their corresponding tweets.
Twitter User Data: Featuring 20,000 rows, each containing a username, a random tweet, account profile, and image/location information.
UMass Global English on Twitter Dataset: Comprising 10,000+ tweets, this dataset is randomly sampled from geotagged Twitter messages and annotated based on their language.
VoterFraud 2020 Dataset: This dataset centers on rumors about voter fraud during the 2020 presidential election, containing 7.6 million tweets and nearly 26 million retweets.

Accessing These Datasets

Whether you’re working on sentiment analysis, content moderation, or any other NLP task, these Twitter training datasets provide a wealth of data to train and fine-tune your machine learning models.

Looking for more Twitter (now X) datasets:
Archive.org Twitter Datasets: It is a collection of free Twitter dataset that have been compiled for study and research. You can use loads of data in this archive and choose the stream you like. The archives contain tons of information that can be sorted.
Documenting the Now Tweet Catalog: An archive of public Twitter data by Documenting the Now, featuring topics like elections, protests, and natural disasters.
Kaggle Twitter Dataset Database: A hub for data science enthusiasts and scholars with an extensive collection of shared datasets.
TweetSets (GWU): Public datasets by GWU, focusing primarily on US politics and current events.
Zenodo Twitter Dataset Database: A repository hosting diverse data and scholarly works, including Twitter datasets contributed by individual researchers.
Archive.org Twitter Datasets: It is a collection of free Twitter dataset that have been compiled for study and research. You can use loads of data in this archive and choose the stream you like. The archives contain tons of information that can be sorted.

In the ever-changing world of social media, Twitter training datasets serve as stable pillars for AI and ML enthusiasts. With a diverse range of datasets available, you can power your projects with real-world Twitter data, ensuring your models are robust, accurate, and ready for the challenges of the digital age. You can also check out our 65 of the best datasets for machine learning.

Stay ahead in the AI and ML game by harnessing the knowledge contained within these datasets, and keep an eye on ethical practices in data annotation to ensure the highest quality in your training data.