## Cosa possiamo imparare dal M5S

Leggo e rispondo al post di Massimo Mantellini (Il M5S, il wifi e il principio di precauzione) in cui si evidenzia con preoccupazione come il Movimento abbia portato in Parlamento, dunque in qualche modo legittimandole, posizioni anti-scientifiche; un “pensiero tossico, banale e a suo modo inattaccabile, che nuoce al Paese intero”.

Il Movimento Cinque Stelle con un bacino elettorale che si aggira tra il 25 e il 30% (8.5-10 milioni di persone) è necessariamente complesso in termini di rappresentanza demografica e di diversità di opinione. Considerando un astensionismo del 25%, se vi trovate in fila al supermercato delle 10 persone che vi precedono circa due votano M5S. Purtroppo questa complessità raramente traspare nelle narrazioni giornalistiche, e chi fa informazione tende (troppo) spesso a preferire i tratti caricaturali (da cappello di carta stagnola o da gita in Corea del Nord, per intenderci). Ma questo tipo di informazione è sbagliata: primo perché distorce nella semplificazione, secondo perché incoraggia comportamenti macchiettistici, grotteschi e sbracati da parte di chi sedendo in istituzioni affollate cerca visibilità.

Una semplificazione che invece ritengo più corretta, e basata sui sondaggi d’opinione dell’elettorato cinquestelle, è che il Movimento risponde efficacemente a due sentimenti diffusi: rabbia e ansia. La voglia di partecipazione diretta attraverso Internet e i meetups é difficile che possano mobilitare più dell’1% dell’elettorato cinquestelle (gli iscritti ai meetups sono circa 90,000). Le cause di rabbia e ansia sono sociali, culturali ed economiche. L’ansia è figlia di una tendenza alla precarizzazione di molti aspetti della vita, non solo economica ma anche sociale e culturale. La rabbia nasce dal percepire come ingiusta la propria situazione a fronte di chi ha accresciuto (indebitamente, si sostiene) il proprio benessere; i corrotti, ma anche anche i privilegiati per motivi generazionali o di status socio-economico.  Il M5S non ha creato ne la rabbia ne l’ansia. Non ha nemmeno creato la storia delle scie chimiche o del finto sbarco sulla luna o della relazione vaccini-autismo. Additarli come untori è intellettualmente scorretto, delegittimarli sa di elitismo. Il Movimento ha semplicemente offerto una piattaforma di rivendicazione in cui si é riversato un malessere che non trovava ascolto da altre parti e che il Movimento – per scelta o per necessità – non sembra filtrare in alcun modo.

È comprensibile voler denunciare opinioni potenzialemente pericolose (“i vaccini causano l’autismo”) o balorde (“i terremoti si possono prevedere”), ma è inutile finché non verranno curate  le cause di rabbia e ansia. E su queste cause mi aspetterei di vedere maggiore attenzione da parte dei media. Una personale lista di cause inizia con la rottura (o forte indebolimento) del rapporto di fiducia con tre istituzioni cardine delle societá democratiche dell’occidente: istituzioni scientifiche, istituzioni economiche (penso a banche e tessuto produttivo), istituzioni rappresentative. E ovviamente i media, che collegano queste con i cittadini. La generale percezione verso queste istituzioni si è ribaltata: da garanti di benessere e prosperitá a ricettacolo di profittatori senza faccia e senza scrupoli. E non è colpa delle pagine complottiste su Facebook. Quelle pagine attirano consenso perché la storia che raccontano è perfettamente compatibile, oltre che divertente, con la visione che in molti (troppi) hanno del mondo che li circonda. È colpa del fatto che molti si sentono esclusi dal benessere che credono gli sia dovuto, e puntano il dito contro chi il benessere lo crea ma non lo distribuisce equamente. Ma è anche colpa di chi, come Mantellini nel suo post, disprezzando ed escludendo – esattamente l’opposto di quello che ha dimostrato di saper fare il M5S (che ci piaccia oppure no) – contribuisce da un lato ad aumentare il senso di emarginazione e rabbia e dall’altro a frammentare il dibattito in rassicuranti microsfere omogenee. Il consiglio che darei a Mantellini è lo stesso che darei a chi è convinto che i vaccini facciano male: mettiti nelle condizioni intellettuali di poter capire chi è giunto a conclusioni diverse.

Friday, 22 July 2016

## Road to Rome: The organisational and political success of the M5S

The Five Star Movement (M5S) obtained two major victories in the second round of municipal elections on 19 June 2016 in Rome and Turin. Rome attracted the most international attention but it is M5S’ victory in Turin that is likely the most consequential for them and other European anti-establishment parties.

In Rome, a municipality with 2.8 million people and an annual budget of €5 billon, Virginia Raggi (age 37) gained doubled the votes of her contender Roberto Giachetti (age 55). In Turin, a city with a population of 900,000 and an annual budget of €1.69 billion, Chiara Appendino (age 31) outstripped Piero Fassino (age 66) by about 10 percentage points.

Continue reading on Pop Politics Aus

Friday, 8 July 2016

## Explicit semantic analysis with R

Explicit semantic analysis (ESA) was proposed by Gabrilovich and Markovitch (2007) to compute a document position in a high-dimensional concept space. At the core, the technique compares the terms of the input document with the terms of documents describing the concepts estimating the relatedness of the document to each concept. In spatial terms if I know the relative distance of the input document from meaningful concepts (e.g. ‘car’, ‘Leonardo da Vinci’, ‘poverty’, ‘electricity’), I can infer the meaning of the document relatively to explicitly defined concepts because of the document’s position in the concept space.

Wikipedia provides the concept space. Each article is a concept: then, en.wikipedia.org/wiki/Car for ‘car’, en.wikipedia.org/wiki/Leonardo_da_Vinci for ‘Leonardo da Vinci’, en.wikipedia.org/wiki/Poverty for ‘poverty’ and en.wikipedia.org/wiki/Electricity for ‘electricity’.

For each input document $$D$$, the analysis results in a vector of weights of length $$N$$ — where $$N$$ is the number of concepts $$c$$ from the concept space — so to describe with a scalar value the strength of the association between the document $$D$$ and concept $$c_j$$ for $$c_j \in c_1, . . ., c_N$$.

## Step 1: Initialise a MySQL database to store data from Wikipedia

The Wikipedia content that we will download needs to be stored in a relational database. MySQL is the main back-end database management system for Wikipedia and is then a natural choice to store the content from the site. The database will seize, depending of the versions, few gigabytes of disk space. It is then necessary to make sure that the there is enough storage available in MySQL data directory (you can refer to this answer to change your MySQL data directory). Assuming now that MySQL is already installed and running on our machine we can create a database called wikipedia from the shell with

echo "CREATE DATABASE wikipedia" | mysql -u username -p

and then run this script (Jarosciak 2013) to initialise the different tables with

mysql -u username -p wikipedia < initialise_wikipedia_database.sql

This will create an empty database populated by all the tables required to store a Wikipedia site (or more precisely all the tables required by MediaWiki the open source software powering all the sites of the Wikimedia galaxy). For a description of the tables see here and for the diagram of the database schema here. Once the database is set-up we can proceed to download and import the data.

## Step 2: Dowload Wikipedia’s data dumps

Wikimedia, the not-for-profit foundation running Wikipedia, conveniently provide the the data dumps of the different language versions of Wikipedia at dumps.wikimedia.org/backup-index.html. Since the data dumps can reach considerably sizes, the repository provides different data packages.

Here I will use data dumps of the Italian version of Wikipedia (to download the English version simply replace it with en) as of February 3, 2016 (20160203).

In order to build our concept map we will start by downloading the corpus of all the pages of the Italian version of Wikipedia which are contained in the file itwiki-20160203-pages-articles.xml. Once imported in the wikipedia database the XML file will fill three tables: the text table, where the body of the articles is actually contained, the page table with the metadata for all pages and the revision table that allows to associate each text to a page through the SQL relations revision.rev_text_id=text.old_id and revision.rev_page=page.page_id. Additionally we will need to download two other data dumps: itwiki-20160203-pagelinks.sql and itwiki-20160203-categorylinks.sql, which will fill respectively the pagelinks table and the categorylinks table. The pagelinks table details all internal links connecting different pages, which we will use this information to estimate the network relevance of each page), and the categorylink table describes the categories which each page belongs to.

## Step 3: Import data dumps into the MySQL database

The three data dumps that we downloaded are in two different formats: SQL and XML. The SQL files are ready to be imported into the database with the following commands

mysql -u username -p wikipedia < itwiki-20160203-pagelinks.sql
mysql -u username -p wikipedia < itwiki-20160203-categorylinks.sql

The XML file we instead require some preprocessing. There are different tools that can assist you in importing a Wikipedia XML data dumps into a relational database. For convenience we use MWDumper. The following command will pipe the content of the XML file directly into the wikipedia database:

java -jar mwdumper.jar --format=sql:1.5 itwiki-20160203-pages-articles.xml | mysql -u username -p wikipedia

MWDumper can also automatically take care of compressed dump files (such as itwiki-20160203-pages-articles.xml.bz2). The process of importing the XML data dump can take few hours. It is then advisable if launched on a remote machine to use a virtual console such as tmux to avoid problems if the connection with the remote machine is interrupted.

We then have a MySQL database wikipedia with information of all current pages, their internal links and the categories.

## Step 4: Mapping categories (optional)

In my case I was interested in limiting my concept map to few thousands concepts of instead of hundred thousands of concepts (or more than 5 million in the case of the English version of Wikipedia) which would have resulted in processing all the articles contained in the data dump. I then create a map hierarchical map of categories constructed as a directed network with categories as nodes and links defined by the relation subcategory of. In the map I targeted specific neighbourhoods, defined by a node of interest and its immediate neighbours, and filtered all articles not belonging to these categories out. Reducing the number of articles used to construct the concept space has both a practical purpose – to bring the computing required for the analysis to a more accessible levels – but also a theoretical purpose, which must find a justification in your analysis (in my case I was interested in reading an online conversation in terms of only few categories of interest).

In R we first need to establish a connection with the MySQL database. For convenience we create a function to pull the entire content of a table from a MySQL database into a data.frame:

getTable <- function(con, table) {
require(DBI)
query <- dbSendQuery(con, paste("SELECT * FROM ", table, ";", sep=""))
result <- fetch(query, n = -1)
dbClearResult(query)
return(result)
}

Then we open the connection with

pw <- "yourpassword"
require(RMySQL)
con <- dbConnect(RMySQL::MySQL(), dbname = "wikipedia", username = "root", password = pw)

and load the page table and the categorylinks with as data.tables

require(data.table)
page <- data.table(getTable(con, "page"))

Mediawiki defines a namespace for each wiki page to indicate the purpose of the page. Pages describing categories, which are of interest now, are indicate with the namespace 14. We then create a data.table containing only category pages with

page_cat <- page[page_namespace == 14,]

Now we need to construct a network describing the hierarchy of categories and the relations between articles and categories (which categories an article belongs to). Each page has category links – described in the categorylinks table – pointing to its parent categories (importantly each category page can have one or more parent categories, this is important because in fact the topology of the network mapping the hierarchy of categories will not be tree). In the categorylinks table the field cl_from stores the page id (the same we find in the field page_id of the page table) while cl_to stores the name of the parent category, which does not necessarily correspond to an actual page list in the table page. In order to unambiguously map the relation between categories we need to join (or merge()) the data.tables so to build an directed edgelist with a page id for each endpoint.

First we need to rename few of the columns

require(plyr)
page_cat <-  plyr::rename(page_cat,
c("page_title" = "cl_to",
"page_id" = "cl_to_id",
"page_namespace" = "cl_to_namespace"))

Encoding(page_cat$cl_to) <- 'latin1' # This might fix encoding issues page_cat <- page_cat[,cl_to:=tolower(cl_to)] setkey(page_cat, 'cl_to') setkey(categorylinks, 'cl_to') and then we merge the data.tables # Merge on cl_to (to are categories) categorylinks <- merge(categorylinks, page_cat[,.(cl_to, cl_to_id, cl_to_namespace)]) # Merge on cl_from page_cat <- page page_cat <- plyr::rename(page_cat, c("page_id" = "cl_from_id", "page_namespace" = "cl_from_namespace")) setkey(page_cat, 'cl_from_id') setkey(categorylinks, 'cl_from_id') # This will remove all links from non-content pages (users, talks, etc) categorylinks <- merge(categorylinks, page_cat[,.(cl_from_id, page_title, cl_from_namespace)]) Once we have a data.table describing each edge of the category network we can create two data.tables, a two-columns data.table as edgelist and a data.table describing each vertex (category or article) of the network with as attribute the name of the page (page_title) and its namespace. edgelist_categorylinks <- categorylinks[,.(cl_from_id, cl_to_id)] vertices_categorylinks <- data.table(page_id = c(categorylinks$cl_from_id,
categorylinks$cl_to_id), namespace = c(categorylinks$cl_from_namespace,
categorylinks$cl_to_namespace)) setkey(vertices_categorylinks, 'page_id') setkey(page, 'page_id') Encoding(page$page_title) <- 'latin1'
vertices_categorylinks <- unique(vertices_categorylinks)

We are now ready to create an igraph object and then to drop all nodes (pages) that are not content articles (namespace == 0) or category description pages (namespace == 14):

require(igraph)
g_ns0_and_ns14 <- g - V(g)[!(V(g)$namespace %in% c(0, 14))]  This will result in a directed network like this By construction – since every link describes the relation between a page and a parent category – article pages have no incoming links, while category pages might have both incoming from subcategories and outgoing links to parent categories. To simplify the plot online five nodes have two outgoing, but this is mostly the case since most of both article and category pages have more than one parent category. If we subset this network removing every article page we obtain a network describing the hierarchy among Wikipedia categories. g_ns14 <- g_ns0_and_ns14 - V(g_ns0_and_ns14)[!(V(g_ns0_and_ns14)$namespace %in% c(14))]

Still many categories will not be of any interest in defining the content of the articles since they are used by Wikipedia contributors to maintain the site and should be removed. Each Wikipedia version has a different set of ‘service’ categories (e.g. Articles_needing_expert_attention) so it is impossible to define general rules in how to remove them.

After removing all categories not related to the content of the article we can target specific macrocategories and fetch all of their neighbours to create a list of categories of interest. All article pages linked to this categories of interest will be selected for creating the concept map. The advantage of this approach is that if we know the general categories we are interested in (e.g. British_pop_music_groups we do not necessarily know all categories in the neighbourhood of this category (following in the example British_boy_bands and British_pop_musicians).

To add categories to a list of categories of interest we proceed as following: we subset g_ns14 by creating an ego-network around a general category, we fetch all categories contained in the resulting ego-network and we store them in a data.frame defined as

cat_to_include <- data.frame(page_id = character(),
page_title = character(),
target_cat = character())

First we create a function that given a graph g and our data.frame cat_to_include will add to the data.frame all the category names contained in g

addToCat <- function(g, cat_to_include, target_cat) {
cat_to_include <- rbind(cat_to_include, data.frame(page_id = V(g)$name, page_title = V(g)$page_title,
target_cat = target_cat,
stringsAsFactors = FALSE))
return(cat_to_include)
} 

Then for each general category we know, let’s say Poverty, we construct an ego-graph with make_ego_graph() and fetch the categories it contains.

g_of_interest <-
make_ego_graph(g_ns14, nodes = V(g_ns14)[V(g_ns14)$page_title == 'Poverty'], order = 2, mode = 'all')[[1]] cat_to_include <- addToCat(g_of_interest, cat_to_include, 'Poverty') The attribute order control the radius of the neighbourhood; order = 2 indicates that we want to include in our ego-graph every node within a distance of two degree from the ego-vertex. Finally, once we have a comprehensive list of categories of interest we want to fetch all articles that belong to these categories. We go back to our graph describing relations among categories and between categories and articles – g_ns0_and_ns14 – and we create from it a new graph called selected_articles which includes only article pages of interest. First, we drop from g_ns0_and_ns14 all category pages (V(g_ns0_and_ns14)$namespace == 14) and all article pages that are not listed in cat_to_include (note that g_ns0_and_ns14 uses as attribute name the page id) and then calculate for each vertex the number of outgoing links.

selected_articles <- g_ns0_and_ns14 -
V(g_ns0_and_ns14)[V(g_ns0_and_ns14)$namespace == 14 & !(V(g_ns0_and_ns14)$name %in% cat_to_include$page_id)] V(selected_articles)$outdegree <- degree(selected_articles, V(selected_articles), mode = 'out')

Second, we want to track for each selected article which categories determined it to be included in our list. Of course since article pages usually belong multiple categories, it is possible that an article was selected because linked to more than one category of interests. For each target category we then create a logical vector storing information on whether an article belongs to it. We do it with

for (supracat in unique(cat_to_include$target_cat)) { cats <- subset(cat_to_include, target_cat == supracat)$page_id

neighs <-
graph.neighborhood(selected_articles, V(selected_articles)[V(selected_articles)$name %in% cats], order = 1) vertices <- unlist(sapply(neighs, getPageIds)) selected_articles <- set.vertex.attribute(selected_articles, supracat, V(selected_articles), V(selected_articles)$name %in% vertices)
}

then converting selected_articles into a data.frame where each row contains the attributes of a node

source('https://raw.githubusercontent.com/fraba/R_cheatsheet/master/network.R')
selected_articles <- subset(selected_articles, namespace == '0')

## Step 5: Concept map

Once we have stored all the necessary Wikipedia tables in the MySQL database we can proceed to build our concept map.

We first create a connection with the MySQL database

pw <- "yourpassword"
require(RMySQL)
con <- dbConnect(RMySQL::MySQL(), dbname = "wikipedia", username = "root", password = pw)

then we load into the R environment the three tables we require as data.tables using the help function getTable

getTable <- function(con, table) {
require(DBI)
require(RMySQL)
query <- dbSendQuery(con, paste("SELECT * FROM ", table, ";", sep=""))
result <- fetch(query, n = -1)
dbClearResult(query)
return(result)
}

require(data.table)
text <- data.table(getTable(con, "text"))
revision <- data.table(getTable(con, "revision"))
page <- data.table(getTable(con, "page"))
dbDisconnect(con)

The table revision is a join table that we need to relate the table text, which contains the actual text of the Wikipedia pages, and the table page which instead contains the name of the page (that is, the title). The relations connecting the three tables are defined as

revision.rev_text_id = text.old_id
revision.rev_page = page.page_id

The goal now is to create a table wiki_pages containing all the information of interest for each Wikipedia page. We do first

setnames(text,"old_id","rev_text_id")
setkey(text, rev_text_id)
setkey(revision, rev_text_id)
wiki_pages <- merge(revision[,.(rev_text_id, rev_page)], text[,.(rev_text_id, old_text)])

and then

setnames(wiki_pages,"rev_page","page_id")
setkey(wiki_pages, "page_id")
setkey(page, "page_id")
wiki_pages <- merge(page[,.(page_id, page_namespace, page_title, page_is_redirect)],
wiki_pages[,.(page_id, old_text)])
Encoding(wiki_pages$page_title) <- 'latin1' wiki_pages now contains page_id, page_title, page_namespace (for details on namespaces used by Wikipedia see this), page_is_redirect (a Boolean field indicating whether the page is actual a simple redirect for another page), and old_text where the text of the article actually is stored. We then create a data.table named redirects (see here for an explanation of redirect pages) with two fields indicating the title of the page that is redirected (from) and the destination of the redirect link (to). In a redirect page the destination of the redirect link is indicated in the text of the page in squared brackets. For example the page UK, redirecting to the page United_Kingdom, containing as body the article the text #REDIRECT [[United Kingdom]]. We can then use the regular expression "\$\\[(.*?)\$\\]" as argument of the function str_extract() to parse the article title from the article text before storing it in the field to (note that since the regular expression is passed as an R string we need to double escape special characters). # Extract all redirects redirects <- wiki_pages[page_is_redirect == 1,] redirects$from <- redirects$page_title getPageRedirect <- function(x) { require(stringr) x <- unlist(str_extract(x, "\$\\[(.*?)\$\\]")) x <- gsub("\$|\$","",x) x <- gsub("(\\||#)(.*?)$","", x)
ifelse1, return(NA), return(gsub(" ","_",x)))
}

redirects$to <- sapply(redirects$old_text, getPageRedirect, USE.NAMES = FALSE)
redirects <- redirects[,.(from, to)]
redirects <- redirects[!is.na(to),]

At this point we can proceed to clean our wiki_pages table by removing every thing we do not need in the analysis. This lines

wiki_pages <- wiki_pages[page_namespace == 0,]
wiki_pages <- wiki_pages[page_is_redirect == 0,]
wiki_pages <- wiki_pages[!grepl("^\\{\\{disambigua\\}\\}",old_text),] 

will remove all pages that are not a content article (page_namespace != 0) and are redirects (page_is_redirect != 0). We also get rid of disambiguation pages (see here), which do not contain any article but simply list other pages.

Preparing the actual text of the articles will require few steps and follows traditional ‘guide lines’ of Natural language processing. With the following function we remove all links containing in the text (usually identified by angle, curly or square brackets), all special characters indicating a new line (\\n), all digits (\\d+) and replace multiple spacing (\\s+) with a single space and finally we lower all characters. We then store the processed version of the text in a new variable clean_text.

# Text cleaning
preprocessText <- function (string) {
string <- gsub("<.*?>|\\{.*?\\}|\$\\[File.*?\$\\]"," ", string)
string <- gsub("[[:punct:]]+"," ", string)
string <- gsub("\\n"," ", string)
string <- gsub("\\d+"," ", string)
string <- gsub("\\s+"," ",string)
string <- tolower(string)
return(string)
}
wiki_pages$clean_text <- preprocessText(wiki_pages$old_text)

A crucial set of decisions is defining which articles to exclude from the analysis. There are two reasons why we must consider reducing the number of articles: filtering articles out will reduce computation and more importantly reduce the number of articles that we would not consider as description of a concept. Gabrilovich and Markovitch (2007) propose two filtering rules:

1. Articles with less then 100 words, which might be only draft of an article or that in any case do not provide enough information to inform the description of a concept, are excluded.
2. Articles with fewer than 5 incoming or outgoing links to other Wikipedia pages are excluded because the presence of only 5 outgoing links might indicate an article in draft form (Wikipedia is always a work-in-progress) and the fact the pages in linked to by only 5 other articles might indicate that the article is not relevant in network terms.

To these two rules I suggest an additional third rules:

1. Articles with a word-to-link ratio of less than 15. Many Wikipedia pages are in fact lists of Wikipedia pages (e.g. (List_of_Presidents_of_the_United_States) and clearly these pages should not be intended as description of any concept. Although it is not always possible to identify whether a page is a list only by the word-to-link ratio, I found that usually a list pages have a relatively higher number of links.

Let’s now apply these three rules. The 100 words threshold is pretty easy to implement

wiki_pages$word_count <- sapply(gregexpr("\\W+", wiki_pages$clean_text), length) + 1
wiki_pages <- wiki_pages[word_count >= 100,]

To calculate the number of outgoing links we can easily count the occurrences of links present in each page. The calculation of the number of incoming links requires instead to check all other pages. We first create a help function that get all internal links (that is links to other Wikipedia pages) embedded in the text of a page:

getPageLinks <- function(x) {
require(stringr)
x <- unlist(str_extract_all(x, "\$\\[(.*?)\$\\]"))
x <- x[!grepl(":\\s", x)]
x <- gsub("\$|\$","",x)
x <- gsub("(\\||#)(.*?)$","", x) x <- gsub(" ","_",x) return(x) } We then create a data.table named edgelist with a row for each internal link found in the Wikipedia articles. The line sapply(wiki_pages$old_text, getPageLinks) will return a list of length equal to the number of the wiki_pages. After we name the list, we can take advantage of the function stack() to convert the list into a data.frame, which we then convert into a data.table.

edgelist <- sapply(wiki_pages$old_text, getPageLinks) names(edgelist) <- wiki_pages$page_id
edgelist <- data.table(stack(edgelist))
require(plyr)
edgelist <- plyr::rename(edgelist, c("values" = "from", "ind" = "from_page_id"))
edgelist <- edgelist[,from:=tolower(from)]

We first merge edgelist, representing all internal links, with redirects since it is possible than a link will point to a redirect page (e.g. to UK instead of United_Kingdom).

# Merge 1
wiki_pages <- wiki_pages[, page_title:=tolower(page_title)]
setkey(wiki_pages, 'page_title')

redirects <- redirects[,from:=tolower(from)]
redirects <- redirects[,to:=tolower(to)]

setkey(edgelist, 'from')
setkey(redirects, 'from')

edgelist_redirect <- merge(edgelist, redirects)
edgelist_redirect <- edgelist_redirect[,from:=NULL]

and then the resulting edgelist_redirect with wiki_pages

edgelist_redirect <- plyr::rename(edgelist_redirect, c("to" = "page_title"))
setkey(edgelist_redirect, "page_title")
edgelist_redirect <- merge(edgelist_redirect, wiki_pages[,.(page_id, page_title)])
edgelist_redirect <- plyr::rename(edgelist_redirect, c("page_id" = "to_page_id"))
edgelist_redirect <- edgelist_redirect[,page_title:=NULL]

We merge edgelist directly with wiki_pages to obtain edgelist_noredirect, which contains the internal links to pages that are not redirect pages.

# Merge 2
edgelist <- plyr::rename(edgelist, c("from" = "page_title"))
setkey(edgelist, 'page_title')
edgelist_noredirect <- merge(edgelist, wiki_pages[,.(page_id,page_title)])
edgelist_noredirect <- plyr::rename(edgelist_noredirect, c("page_id" = "to_page_id"))
edgelist_noredirect <- edgelist_noredirect[,page_title:=NULL]

We now have a complete list of all internal links, either to redirect or non-redirect pages by rbind(edgelist_noredirect, edgelist_redirect) and we use it to create a directed graph where nodes are pages and the edges described the internal links connecting the pages

require(igraph)
g <- graph.data.frame(rbind(edgelist_noredirect, edgelist_redirect))

and for each page we calculate incoming (indegree) and outgoing (outdegree) links.

degree_df <-
data.frame(page_id = V(g)$name, indegree = degree(g, V(g), 'in'), outdegree = degree(g, V(g), 'out'), stringsAsFactors = FALSE) Based on the second rule listed above we should drop all pages with less than 5 incoming and outgoing links. We store the list of page_ids in the character corpus_ids. corpus_ids <- subset(degree_df, indegree >= 5 & outdegree >= 5)$page_id

Optionally, if we selected a subset of articles we are interested by targeting specific categories we can additionally reduce the number of pages we will consider in the concept analysis with

corpus_ids <-
corpus_ids[corpus_ids %in% as.character(selected_articles$name)]  The articles will now be treated bag-of-words and jointly analysed to construct a (term frequency–inverse document frequency)[https://en.wikipedia.org/wiki/Tf%E2%80%93idf] (tf-idf) matrix. That is, the position of individual terms within each document is disregarded and each document is represented by a vector of weights (or scores) of length equal to the number of terms found in the entire corpus of documents. Weights assigned to each term are a function of the number of terms found in the document (tf) and inversely of the number of documents in the corpus where the term is found (idf), so that terms that appears only in few documents will be assigned a relatively higher score than terms that appears in most of the documents of the corpus (for more details see Manning, Raghavan, and Schütze 2008, ch. 6). We create a data.table only with the document we want to include in the analysis concept_corpus <- wiki_pages[page_id %in% corpus_ids, .(page_id, page_title, clean_text)] and we process the corpus of documents (that was already clean) before computing a td-idf matrix by removing stop words and stemming the remaining terms. “r require(tm) require(SnowballC) tm_corpus <- Corpus(VectorSource(concept_corpus$clean_text))

tm_corpus <- tm_map(tm_corpus, removeWords, stopwords(“italian”), lazy=TRUE) tm_corpus <- tm_map(tm_corpus, stemDocument, language=“italian”, lazy = TRUE) wikipedia_tdm <- TermDocumentMatrix(tm_corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = TRUE)))

rwikipedia_tdmis our end product. It is a term--document matrix where the rows represent the terms present in the corpus and the columns the documents. The cells of the matrix represent the weights of each pair term--document. Withwikipedia_tdmwe can represent any other term--document matrix computed from another corpus in terms of the the Wikipediawikipedia_tdm, that is we can map the position a corpus of document in the concept space defined by the Wikipedia articles. We do this by a simple matrix operation.

Let forum_tdm be an tf-idf term–document matrix created from a corpus of online comments on a forum. We extract all the terms from forum_tdm and intersect them with the terms of wikipedia_tdm (we want to simplify the computation than we drop all terms that do not appear in forum_tdm),

w <- rownames(forum_tdm)
w <- w[w %in% rownames(wikipedia_tdm)]
wikipedia_tdm_subset <- wikipedia_tdm[w,]

and finally we obtain a concept_matrix with

concept_matrix <-
crossprod_simple_triplet_matrix(forum_tdm[w,],
wikipedia_tdm_subset[])

colnames(concept_matrix) <- concept_corpus\$page_id
rownames(concept_matrix) <- rownames(forum_tdm)

The concept_matrix assigns to each pair comment–concept a score. It is then possible to interpret each comment from the forum_tdm in terms of the scoring of its concepts. Specifically an insight into the meaning of comments might be derived from the 10/20 concepts that received the highest score.

## Step 6: Visualisating a discussion in 2D

By locating each document of a corpus of interest within a concept space we can quantify the ‘distance’ between each pair of documents. Of course the concept space is a multidimensional space where instead of the three axis of the space we experience around us (width, height and depth) we have an axis for each of the concept of the concept map (that is, potentially hundred of thousands of axis). Nevertheless there exist many mathematical techniques for dimensionality reduction that in practical terms can bring the number of dimensions down to two or three, then opening the way to visualisation. I detail here how to use a technique called t-SNE (Van der Maaten and Hinton 2008) to visualise about 4,000 documents (blog posts, comments and parliamentary bills) discussing the introduction of citizen’s income in Italy based on the concept space we computed before.

First we need to calculate the cosine distance matrix from our concept_matrix. We should remind that our concept_matrix store the weights that were assigned to each pair document–concept. If we think about it in spatial terms, for each document the concept_matrix will tell us its relative distance from each concept. But what we want to visualise is the distance separating each pair of documents, that is we need a document–document matrix. The transformation is performed calculating the cosine similarity of the concept_matrix. We first transpose the concept_matrix with t() and then calculate a cosine distance matrix with the package slam.

# Cosine
require(slam)
concept_matrix <- as.simple_triplet_matrix(t(concept_matrix))
cosine_dist_mat <-
1 - crossprod_simple_triplet_matrix(concept_matrix)/
(sqrt(col_sums(concept_matrix^2) %*% t(col_sums(concept_matrix^2))))

Finally with the package tsne we fit our data to produce a matrix of two columns with $$xy$$ coordinates to plot each document as a dot on a 2D plane.

require(tsne)
fit <- tsne(cosine_dist_mat, max_iter = 1000)

This is the result rendered with ggplot2:

The figure is part of my research on online deliberation and the Italy’s Five Star Movement (M5S). In the figures (top panel) I color coded the document based on five macro concepts — which were used to identify each document — and identified the bill that was presented in Parliament (also a document in my corpus) with a triangle. In the second row of plots from the top I map the 2D kernel density of documents belonging to each macro concept, in the third and fourth row the temporal evolution of the discussion on two platforms (a forum and a blog).

# References and R packages

Auguie, Baptiste. 2015. GridExtra: Miscellaneous Functions for “Grid” Graphics. https://CRAN.R-project.org/package=gridExtra.

Bouchet-Valat, Milan. 2014. SnowballC: Snowball Stemmers Based on the c Libstemmer UTF-8 Library. https://CRAN.R-project.org/package=SnowballC.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. http://igraph.org.

Databases, R Special Interest Group on. 2014. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.

Donaldson, Justin. 2012. Tsne: T-Distributed Stochastic Neighbor Embedding for R (T-SNE). https://CRAN.R-project.org/package=tsne.

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5): 1–54. http://www.jstatsoft.org/v25/i05/.

Gabrilovich, Evgeniy, and Shaul Markovitch. 2007. “Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis.” In IJCAI, 7:1606–11. http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf.

Hornik, Kurt, David Meyer, and Christian Buchta. 2014. Slam: Sparse Lightweight Arrays and Matrices. https://CRAN.R-project.org/package=slam.

Jarosciak, Jozef. 2013. “How to Import Entire Wikipedia into Your Own MySQL Database.” Joe0.com. http://www.joe0.com/2013/09/30/how-to-create-mysql-database-out-of-wikipedia-xml-dump-enwiki-latest-pages-articles-multistream-xml/.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. New York, NY: Cambridge University Press.

Neuwirth, Erich. 2014. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.

Ooms, Jeroen, David James, Saikat DebRoy, Hadley Wickham, and Jeffrey Horner. 2016. RMySQL: Database Interface and ’MySQL’ Driver for R. https://CRAN.R-project.org/package=RMySQL.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of Machine Learning Research 9 (2579-2605): 85. http://siplab.tudelft.nl/sites/default/files/vandermaaten08a.pdf.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://had.co.nz/ggplot2/book.

———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.

———. 2015. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, and Romain Francois. 2015. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

1. grepl("[A-Za-z]:[A-Za-z]",x []

Tuesday, 26 April 2016

## Italy’s Five Star Movement – a spectral analysis of its political composition

To talk about identity and soul of the Five Star Movement (M5S) is not only politically contentious but also practically challenging because of the different axes (at least three) along which the M5S has been developing: the vertical top-down axis from Beppe Grillo to his followers (and sympathising voters), the horizontal axis connecting thousands of militants across the country to local, flexible and loosely organised meetups, and finally the cloudy axis linking Internet users through the different online communicative platforms pertaining to the Movement. The academic literature and the media have been prevalently interested in mapping the provenance of votes. I will try here to show some data also on the position of the M5S derived from its 2013 electoral program and the political background of both the onsite and online activists of the Movement.

But let’s first start briefly introducing the trajectory of a movement that vehemently refuses to be called a party or to be associated with any traditional political identity.

Continue reading on the blog of the WZB.

Tuesday, 12 May 2015

## NDVI, risk assessment and developing countries

The Normalized Difference Vegetation Index (NDVI) estimates the greenness of plants covering the surface of the Earth by measuring the light reflected by the vegetation into space. The main idea behind the NDVI is that visible and near-infrared light is absorbed in different proportions by healthy and unhealthy plants: a green plant will reflect 50% of the near infrared-light it receives and only 8% of the visible light while an unhealthy plant will reflect respectively 40% and 30%. NDVI can then be used to quantitatively compare vegetation conditions across time and space (and indeed is quite widely used, a Google Scholar search on NDVI produced 60,500 hits).

I recently used the NDVI in the analysis of  the data produced by a survey of 500 Afghan farmers. The survey wanted to assess the benefits in terms of production from using improved seeds against local seeds. The problem in comparing yield in different parts of the country is that (of course) environmental conditions do affect production and thus partially explain differences in outputs. My idea was then to test if was possible to determine how much of the difference was explained by the seed variety and how much by the environmental conditions as captured by the NDVI. My regression analysis showed indeed that there was a significant correlation not only between the variety adopted and output but also between the NDVI and the output.

The next idea was then to explore how the NDVI could be used to quantitatively assess environmental risks in developing countries. What I did was to assess how many people were effected by the 2008 drought in Afghanistan and I did it on a zero budget, using only free software and publicly available datasets.

1) The NDVI satellite imagery available for Afghanistan cover a period of 13 years, it is then possible to compare the NDVI values for 2008 with the average value during the period 2000-2007. This difference is defined as the NDVI anomaly and can be used as an indicator of drought. Different products are available for download from the portal of the U.S. Geological Survey which differ for resolution (250m to 1000m) and temporal granularity (8 day to yearly). For my experiment I chose to use one raster (MYD13A1) per  year based on imagery taken the same day every year (the 13th of August). It is actually possible to calculate the average by using more than one raster per year so to take into consideration the possibility of variations in raining patterns from one year to the other, but in any case the period used to calculate the anomaly should coincide with the plants growing season for the region under study.

2) To calculate the anomaly I used the open source Quantum GIS (QGIS) and the plugin RasterCalc. The formula I applied is quite straightforward: 2008 – (2000 + 2001 + 2002 + 2003 + 2004 + 2005 + 2006 + 2007)/8. A raster file contains one value, in this case the NDVI, for each pixel  (each pixel in the real world represents a square of 500m size); the value for each square of 2008 is then compared with the average value of the preceding years (this of course is possible because the imagery precisely overlaps). This is the result:

NDVI anomaly for August 2008 based on the mean of the NDVI values of the 8 previous years (click to enlarge)

The brownish regions represent a negative anomaly (indicating a vegetation less healthy than the average) while the greenish regions a positive anomaly. Darker colours indicates a stronger anomaly in both directions. To put these colours into context I run the same analysis (with the same type of imagery) to photograph the exceptional drought that hit the United States in 2012. With the same colour scale as above here the result:

NDVI anomaly for August 2012 based on the mean of the NDVI values of the 12 previous years (click to enlarge)

According to the NDVI values, the 2012 drought in the US was stronger than that in 2008 in Afghanistan.

Once prepared a raster with NDVI anomaly values, I had to try to quantify the population affected. To estimate it I decided to use the number of people living in the proximity in areas with significant negative NDVI anomalies. The first step was to decide what was a significant anomaly: I quantify it with a value of -200 or lower (that is, because the NDVI is expressed on scale from 0 to 10,000, a difference of -2% from the average). Arguably it is possible to fix the threshold at a lower level (like -10%) but it really depends on how much one considers the average of the considered years to represent a measure of a healthy vegetation.

3) The data for the population was downloaded from the HumanitarianResponse website which collects dataset from different sources. The website is run by the United Nations Office for the Coordination of Humanitarian Affairs (UNOCHA). The dataset I downloaded contained information (including location) on 17,075,000 people, 2,731,000 households, 37,767 villages. These figures are reasonably close to the last estimate of the Afghan government for the total rural population (about 19 million people. Unfortunately the source of the dataset is not indicated -possibly it is the Central Statistics Organisation of Afghanistan (CSO)- nor when the data was collected. Nevertheless I cross checked the dataset with population estimate for different provinces as published on the CSO website and they seem close enough. So let’s take it as a good approximation of reality in rural Afghanistan in 2008.

Now all the cards are on the table: we have figures and geographic locations for both population and NDVI anomaly. We need to intersect the data.

4) The data comes in two different format. Population is in a vector layer, that is, a table with each row representing one point on the map, while NDVI anomaly values comes in a raster layer with each pixel representing a square of 500m size. To understand which villages where at risk of drought in 2008 we need to calculate the mean anomaly for the area surrounding the village. We do this with QGIS using the “Buffer” tool: we will draw a circular polygon around each village with radius 5km (to be exact 0.05 degrees).

5) We are now ready to intersect the new polygons with the raster containing the anomaly. This time we use the “Zonal statistics” tool of QGIS, which will give us for each polygon the mean value of the pixels contained in the corresponding geographic region on the raster layer. In other words, we obtained the mean anomaly in the area surrounding each village.

6) The final step is to query the vector layer to identify the villages where the mean anomaly is less or equal to -200, our threshold level, and sum the population of each village. Because the table containing the statistics on villages is pretty big (37,000 rows and about 200 columns) we want to use a SQL-based database management system (I used the open source Postgres) to perform the following SQL queries:

1. SELECT SUM(population) FROM vector_layer WHERE mean; <= -200; (for the sum of the population affected)
2. SELECT (SELECT SUM(population) FROM vector_layer WHERE mean <= -200) / SUM(population) FROM vector_layer; (for the proportion of the population affected over the total population)
3. SELECT t1.province, pop_risk, pop_tot, pop_risk/pop_tot AS percFROM(SELECT province, SUM(population) as pop_risk FROM vector_layer WHERE mean <= -200 GROUP BY province) AS t1JOIN(SELECT province, SUM(population) as pop_tot FROM vector_layer GROUP BY province) AS t2ON t1.province = t2.provinceORDER BY perc DESC; (for a table with total population affected  as “pop_risk” and percentage  as “perc” over the total population in affected provinces)

These the results:

1. 4,407,538 people affected;
2. 25.81% of the total rural population affected;
3.  province pop_risk pop_tot perc Badakhshan 738,691 753,823 97.99% Baghlan 562,410 580,518 96.88% Samangan 235,570 260,066 90.58% Faryab 592,039 700,001 84.58% Sari Pul 341,378 409,019 83.46% Balkh 516,495 678,190 76.16% Jawzjan 225,552 309,529 72.87% Takhar 433,794 667,089 65.03% Kunduz 323,966 508,113 63.76% Panjsher 49,212 95,716 51.41% Badghis 177,413 499,523 35.52% Bamyan 47,711 329,084 14.50% Paktya 26,943 468,371 5.75% Khost 34,640 640,945 5.40% Hirat 69,908 1,293,924 5.40% Parwan 11,586 421,690 2.75% Maydan Wardak 9,513 509,320 1.87% Nimroz 817 92,747 0.88% Ghor 4,537 606,504 0.75% Paktika 5,363 794,594 0.67%

Finally, with the same queries, let’s try to lower the threshold to -1000, that is a -10% difference from the mean NDVI over the previous 8 years:

1. 33,368 people affected;
2. 0.2% of the total rural population affected;
3.  province district pop_risk pop_tot perc Kunduz Kunduz 28,089 97,457 28.82% Badakhshan Khwahan 3660 14,263 25.66% Kunduz Khanabad 1572 99,222 15.84% Kunduz Chahar Dara 47 37,669 0.12%

Thursday, 14 February 2013

## Eyes on Guatemala

The Economist has published an article on malnutrition in Guatemala. Hunger is not new in the country, with half of the children population not eating enough Guatemala is the six-worst country in the world, but in some Maya communities children chronic malnutrition can reach 75% (the Economist says 80%). These figures are astonishing, especially because the problem is not food scarcity.

But this as well is hardly new. It was 1981 when Amartya Sen published his Poverty and Famines: An Essay on Entitlement and Deprivation demonstrating that hunger is mostly caused by inequality rather than scarcity. There is no lack of food in Guatemala if you have the money to buy it. In Guatemala City is taking place, as we speak, the 14th Festival Gastronómico Internacional so it seems difficult to talk about a famine or about an emergency (according to the Longman Dictionary an emergency is “an unexpected and dangerous situation that must be dealt with immediately”). The problem is the lack of a functioning state. Because a state cannot function with tax revenues estimated at just 10% of GDP.

Democracy is highly unrepresentative in Guatemala. Who should push for a better redistribution of resources has no voice. National newspapers point constantly the finger at the government (presidency, parliament, judiciary) in a impressive campaign of delegitimation. The Rosenberg tape was just part of it. I’m not defending the government, but saying that criticising it and attempting to systematically destroy its credibility are not quite the same thing. While the headlines cover crime, corruption and hunger the real battle within the country is on the tax reform. A battle that so far every government has badly lost.

Friday, 28 August 2009

## Back into Poverty

Increase in food prices has pushed back into poverty at least 100 million people in 2008 and, according to the United Nations Standing Committee on Nutrition (here, p. 60),

erase at least four years of progress towards the Millennium Development Goal (MDG) 1 target for the reduction of poverty. The household level consequences of this crisis are most acutely felt in LIFDCs [Low-Income Food-Deficit Countries] where a 50% rise in staple food prices causes a 21% increase in total food expenditure, increasing these from 50 to 60% of income. In a high income country this rise in prices causes a 6% rise in retail food expenditure with income expenditure on food rising from 10 to 11%. FAO estimates that food price rises have resulted in at least 50 million more people becoming hungry in 2008, going back to the 1970 figures.

According to the World Bank (here) this means that between 200,000 and 400,000 more children will died every year for malnutrition until 2015.

Thursday, 18 June 2009

## Selva Amazónica, More Valuable Standing Than Felled

An article published on Science this week analyzes the development of the region across the Amazon deforestation frontier. In three words: boom and bust. It means that comparing the Human Development Index of different classes of Brazilian municipalities, from prefrontier municipalities to heavily post frontier deforested municipalities, you can see how the HDI relatively grows in the first phase of the deforestation (on the frontier line) and relatively declines when deforestation is completed.  In other words,

when the median HDI of each class is plotted against deforestation extent, a boom-and-bust pattern becomes apparent, which suggests that relative development levels increase rapidly in the early stages of deforestation and then decline as the frontier advances. Hence, although municipalities with active deforestation had development levels that approached the overall Brazilian median, pre- and postfrontier HDI values were substantially lower and statistically indistinguishable from each other (P > 0.9). These results are robust to the particular thresholds used to define the frontier classes. A boom-and-bust pattern is also found for each of the HDI subindices: standard of living, literacy, and life expectancy.

This strongly suggests that the poor have no choice but to exploit every resource available. It is difficult to think that farmers or loggers do not see that they are compromising their very own future. They simply have no choice. The challenge is giving them a choice.

Friday, 12 June 2009

## On the Evolution of Thinking

What if we are becoming the very same Artificial Intelligence that we are trying to design? The doubt has has been raised by Nicholas Carr in an article published one year ago on The Atlantic and now published on Le Monde. The theory is intriguing and the discourse goes, in the words of developmental psychologist Maryanne Wolf, more or less in this direction:

We are not only what we read, We are how we read.

So, learning directly from the voice of Socrates is not the same as learning from the Internet. The way we approach new ideas and knowledge influences how we assimilate them and how we develop our thinking. The risk is that our mind might find so attractive the effectivness of the Google’s algorithm to try to replicate it forgetting all the ambiguity that has made us what we are. What we are so far.

Update: Have a look at this article on Le Monde about the influence of the new information technologies on culture.

Friday, 5 June 2009

## If the Asian Growth Model is not Working Anymore

In 1981 poverty rate in China was 64% of the population, in 2004 the rate was 10%: it means that 500 million people stepped out of poverty (look here and here). China and South-East Asia economies were propelled by export demand and by someone else’s debt. What now? In the words of FT columnist Michael Pettis

The assumption that implicitly underlay the Asian development model – that US households had an infinite ability to borrow and spend – has been shown to be false. This spells the end of this model as an engine of growth.

It seams like bad news for economists pointing at free trade and export-led growth as a practical receipt for development.  It seams like bad news for everybody. People in developing countries need to increase their income, and it is difficult to think how they could find the money in their neighborhoods.

Tuesday, 19 May 2009