First, we crawled a near-complete dataset from Twitter, containing more than 54 million users, 1.9 billion links, and almost 1.8 billion tweets. Second, we created a labeled collection with users “manually” classified as spammers and non-spammers. Third, we conducted a study about the characteristics of tweet content and user behavior on Twitter aiming at understanding their relative discriminative power to distinguish spammers and non-spammers.
Lastly, we investigate the feasibility of applying a supervised machine learning method to identify spammers
We also investigate different tradeoffs for our classification approach namely, the attribute importance and the use of different attribute sets.
DATASETANDLABELEDCOLLECTION we need a labeled collection of users, pre-classified into spammers and non-spammers. To the best of our knowledge, no such collection is publicly available. We then had to build one.
Crawling twitter
So, to that end, we asked Twitter to allow us to collect such data and they white-listed
58 servers located at the Max Planck Institute for Software
Systems (MPI-SWS), located in Germany1.
We plan to make this data available to the wider community. For a detailed description of this dataset we refer the user to our project homepage [3].
http://twitter.mpi-sws.org.
{this is where we took our data from}
Building a labeled collection
In order to meet these three desired properties, we focus on users that post tweets about three trending topics largely discussed in 2009. (1) the Michael Jackson’s death, (2) Susan
Boyle’s emergence, and (3) the hashtag “#musicmonday”.
Table 1 summarizes statistics about the number of tweets we have in our dataset as well as the number of unique users that spread these tweets.
By choosing