When the data of the Forum of the Five Star Movement (M5S) was originally collected, the website www.beppegrillo.it had no index listing all pages present in the domain. In order to collect the content of the Forum I first indexed all pages in the directory www.beppegrillo.it/listeciviche/forum/. Indexing was performed combining results from queries submitted via the API of Google Custom Search Engine and Disqus. Once (arguably) all URLs of the Forum's pages were indexed, I requested each, parsed the content and stored the relevant information (e.g. username, proposal title and body text, proposal date of publication, comments, etc.) in a relational database. The data was subsequently processed and cleaned.
After indexing the pages of the Forum, a web-crawler requested every page, one every 30 seconds for a maximum of 3000 requests per day. The crawler run between and downloading a total of pages. After being downloaded, the information presented on each page (about the author of the proposal, the proposal, the authors of comments and the comments) were parsed and stored in an relational database. For pages published before the adoption of Disqus as the Forum commenting system in mid-2012, the parsing process was guided by the HTML tags, which are embedded in the page and structure the document parts. For the most recent pages only the body text and the publication date of the proposal were parsed directly from the HTML page, with the remaining information being parsed from the JSON document produced as response to the request addressed to the Disqus API
The following table shows the raw data gathered until and published between and on the Forum:
|Proposals||Comments||Unique users||Non-unique users|
The Forum appears to contain several duplicates of the same discussion threads. For example the same proposal by the same author might appear in different webpages as ...proposal.html, ...proposal-1.html, ...proposal-6.html. In total I removed duplicated proposals not linked to any comment.
From the raw data I attempted to identify suspect spam comments and relative spam users. This task was significantly simplified by the fact that the majority of spam comments were posted in English. Once the comments and the users were identified they were both removed. The following table displays the number of comments and users filtered out as spam:
|Spam comments||Spam users|
In order to track behaviour and actions of the same person over time - thus both when posting a comment or a proposal - I attempted to identify suspects in the two user sets who could reasonably assumed to be the same individual. As a rule, I excluded all users with a single-word username or users with a digit in the username (e.g. "Paul John 2"). Once suspects were identified, all their comments were delinked from them and linked to the corresponding user in the unique set. Then duplicates were eliminated from the non-unique set. Again when a suspect with multiple usernames were identified all comments were delinked and successively linked to a unique user, while the duplicates eliminated.