Twitter Bot Identification

Through this article I will try to show how can various classification algorithms be used to identify whether a twitter account is a bot or not.

Gagan Talreja
3 min readSep 29, 2019

What is a Twitter Bot?

Bots on Twitter are semi-robotized or computerized programs that utilization the typical elements of Twitter, for example, tweeting, re-tweeting, and posting content. Not at all like other internet based life sites, for example, LinkedIn, Twitter permits the utilization of bots on their foundation. There are many different types of Twitter bots and their purposes vary from one to another. Some bots may tweet helpful material while some bots may be used to spread misinformation or a certain propaganda which can affect the mindset of people. In total, Twitter bots are estimated to create approximately 24% of tweets that are on Twitter.

Identification of a Twitter Bot

Detecting non-human Twitter users has been of interest to academics Indiana University has developed a free service called Botometer (formerly BotOrNot), which scores Twitter handles based on their likelihood of being a Twitter bot. One significant academic study estimated that up to 15% of Twitter users were automated bot accounts. The prevalence of Twitter bots coupled with the ability of some bots to give seemingly human responses has enabled these non-human accounts to garner widespread influence.

Intrinsically, the trouble of distinguishing bots on an internet based life stage like Twitter is the reality that there is no chance to get of completely comprehending what a bot resembles. Not at all like scholarly informational indexes, there is no ground truth or marks for these records.

I decided to use some machine learning based classification models to identify that if we have some data for a given twitter handle, can we classify it as a bot? I obtained an annotated dataset from kaggle that contained the follower count, listed count, friends count, tweet description and some other info using which I classified the account as a bot. For this purpose, I decided to use two classification techniques: User based and Tweet based.

User Based Classification

This classifier is intended to pass judgment on whether a client was Twitter bot or not founded on the properties of the client itself. This takes the accompanying measurements: tweet count, follower count, favorite count, friends count and listed count. Obviously, these parameters are restricted and more work is required to grow this. I implemented three classification techniques namely: CART Decision Tree, Random Forest and Logistic Regression. For each technique, I calculated the precision, recall and accuracy scores and plotted the confusion matrix.

CART Decision Tree

Decision Tree Classifier with an accuracy of approx. 84%

Random Forest Classifier

Random Forest Classifier with an accuracy of approx. 82%

Logistic Regression

Logistic Regression(Regularized) with an accuracy of approx. 70%

Tweet Based Classification

This classifier used the Naive Bayes Classifier and classified accounts on the basis of their tweets. Firstly the tweet data was cleaned by removing all the URLs, stop-words and punctuation symbols. Then the data was arranged according to the input format of the nltk NaiveBayesClassifier. The Naive Bayes Classifier creates probabilistic models of words occurring in the different classes based on Bayes Theorem that predict how likely a Tweet is from a bot based on the terms that the Tweet uses. As this is a Naive Bayes model, it ignores the order of the words and focuses just on the appearance of the words in the Tweet.

Text Cleaning

Text Cleaning & Pre-processing

Naive Bayes Classifier

Naive Bayes Classifier with an accuracy of approx. 72%

Implications

In its current, least feasible item structure, this undertaking demonstrates that it is conceivable to make a Twitter that recognizes whether Tweets are from bots and whether clients are bots with fundamental AI based calculations with genuinely high correctness and a moderately straightforward plan. The models don’t utilize huge capabilities and are oversimplified, untuned models that can be improved later on. The model picks up such a large number of false positives, so more work is unquestionably required to make this a creation application.

Youtube Video

Youtube Video about the article

Thanks for reading. Find me here.

--

--