Explicit semantic analysis with R

Explicit semantic analysis (ESA) was proposed by Gabrilovich and Markovitch (2007) to compute a document position in a high-dimensional concept space. At the core, the technique compares the terms of the input document with the terms of documents describing the concepts estimating the relatedness of the document to each concept. In spatial terms if I know the relative distance of the input document from meaningful concepts (e.g. ‘car’, ‘Leonardo da Vinci’, ‘poverty’, ‘electricity’), I can infer the meaning of the document relatively to explicitly defined concepts because of the document’s position in the concept space.


Tuesday, 26 April 2016


Twitter: frbailo



RSS r-bloggers.com

  • xts 0.10-2 on CRAN
    This xts release contains mostly bugfixes, but there are a few noteworthy features. Some of these features were added in version 0.10-1, but I forgot to blog about it. Anyway, in no particular order: endpoints() gained sub-second accuracy on Windows (#202)! na.locf.xts() now honors 'x' and 'xout' arguments by dispatching to the next method (#215). Thanks […]
  • RcppSMC 0.2.1: A few new tricks
    A new release, now at 0.2.1, of the RcppSMC package arrived on CRAN earlier this afternoon (and once again as a very quick pretest-publish within minutes of submission). RcppSMC provides Rcpp-based bindings to R for the Sequential Monte Carlo Templat...
  • Exploring the underlying theory of the chi-square test through simulation – part 1
    Kids today are so sophisticated (at least they are in New York City, where I live). While I didn’t hear about the chi-square test of independence until my first stint in graduate school, they’re already talking about it in high school. When my kids came home and started talking about it, I did what I […]
  • R Tip: Use stringsAsFactors = FALSE
    R tip: use stringsAsFactors = FALSE. R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string. Sigmund Freud, it is often claimed, said: “Sometimes a cigar is just a cigar.” To avoid problems delay re-encoding of strings by using stringsAsFactors […]
  • RcppClassicExamples 0.1.2
    Per a CRAN email sent to 300+ maintainers, this package (just like many others) was asked to please register its S3 method. So we did, and also overhauled a few other packagaging standards which changed since the previous uploads in December of 2012 ...

RSS Simply Statistics

  • What do Fahrenheit, comma separated files, and markdown have in common?
    Anil Dash asked people what their favorite file format was. David Robinson replied: CSV is similar to Markdown. No one global standard (though there are attempts) but a damn good attempt at "Whatever humans think it is at a glance, they're probably right"— David Robinson (@drob) February 8, 2018 His tweet reminded me a lot […]
  • Some datasets for teaching data science
    In this post I describe the dslabs package, which contains some datasets that I use in my data science courses. A much discussed topic in stats education is that computing should play a more prominent role in the curriculum. I strongly agree, but I think the main improvement will come from bringing applications to the […]
  • A non-comprehensive list of awesome things other people did in 2017
    Editor’s note: For the last few years I have made a list of awesome things that other people did (2016,2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! […]

RSS Statistical Modeling, Causal Inference, and Social Science

  • Wanna know what happened in 2016? We got a ton of graphs for you.
    The paper’s called Voting patterns in 2016: Exploration using multilevel regression and poststratification (MRP) on pre-election polls, it’s by Rob Trangucci, Imad Ali, Doug Rivers, and myself, and here’s the abstract: We analyzed 2012 and 2016 YouGov pre-election polls in order to understand how different population groups voted in the 2012 and 2016 elections. We […]
  • The New England Journal of Medicine wants you to “identify a novel clinical finding”
    Mark Tuttle writes: This is worth a mention in the blog. At least they are trying to (implicitly) reinforce re-analysis and re-use of data. Apparently, some of the re-use efforts will be published, soon. My reply: I don’t know enough about medical research to make any useful comments here. But there’s one bit that raises […]
  • What are the odds of Trump’s winning in 2020?
    Kevin Lewis asks: What are the odds of Trump’s winning in 2020, given that the last three presidents were comfortably re-elected despite one being a serial adulterer, one losing the popular vote, and one bringing race to the forefront? My reply: Serial adulterer, poor vote in previous election, ethnicity . . . I don’t think […]