Word2vec
Visualising Word2Vec Word Vectors
Rebecca Barter
A speedy introduction to Word2Vec
To blatantly quote the Wikipedia article on Word2Vec:
Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec takes as its input a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space
Basically, Word2Vec was developed by a bunch of clever Googlers as a way to represent words as high-dimensional vectors such that the relative positions of these vectors in space is meaningful in terms of linguistic context. Somehow, this embedding gives rise to algorithmic magic in which equations such as
king - queen = man - woman
and its logical extension,
king - queen + woman = man
make sense.
Most techniques for visualising relationships between high-dimensional word vectors rely on dimensionality reduction techniques, such as projection onto principal components, or using t-SNE. Some of these visualisations are actually pretty cool.
This case study introduces an alternative approach to visualising the relationship between word vectors in R that does not rely on dimensionality reduction. Welcome to the beautiful world of superheatmaps!
Preparing the data
# load in some useful libraries
library(knitr)
library(dplyr)
library(reshape2)
library(cluster)
library(ggplot2)
The Google News Word2Vec data can be found here. A long time ago, I used the convert_Word2Vec.py
Python script to convert it to a .csv, and then read it into R and saved it as an .RData file. This script was adapted from Franck Dernoncourt’s response to this Stack Overflow post
Identifying the most common words from the NY Times headlines
First, we want to limit our data to some list of common words. Specifically, we will identify the most common words from New York Times headlines compiled by Professor Amber E. Boydstun at UC Davis. The headlines data can be obtained from the RTextTools
R package. Having identified these common words, we will then filter our GoogleNews Word2Vec dataset to these common words.
library(RTextTools)
# The NYTimes dataset can be extracted from the RTextTools package
data(NYTimes)
# view the first 6 rows
kable(head(NYTimes, 3))
Article_ID | Date | Title | Subject | Topic.Code |
---|---|---|---|---|
41246 | 1-Jan-96 | Nation’s Smaller Jails Struggle To Cope With Surge in Inmates | Jails overwhelmed with hardened criminals | 12 |
41257 | 2-Jan-96 | FEDERAL IMPASSE SADDLING STATES WITH INDECISION | Federal budget impasse affect on states | 20 |
41268 | 3-Jan-96 | Long, Costly Prelude Does Little To Alter Plot of Presidential Race | Contenders for 1996 Presedential elections | 20 |
In preparation for calculating word frequencies, we first restrict to the headlines, and will then do some pre-processing using the tm
package.
library(tm)
# Extract the title from each article entry
nyt.headlines <- as.character(NYTimes$Title)
# convert the headline titles to a tm corpus
nyt.corpus <- Corpus(VectorSource(nyt.headlines))
The pre-processing steps involve removing stop-words like “a”, “and”, and “the”, converting all words to lowercase, and remove punctuation, numbers and spacing.
# convert to lowercase
nyt.corpus <- tm_map(nyt.corpus, content_transformer(tolower))
# remove punctuation
nyt.corpus <- tm_map(nyt.corpus, removePunctuation)
# remove numebrs
nyt.corpus <- tm_map(nyt.corpus, removeNumbers)
# remove whitespace
nyt.corpus <- tm_map(nyt.corpus, stripWhitespace)
# remove stop words
nyt.corpus <- tm_map(nyt.corpus, removeWords, stopwords("english"))
Next to get the word counts, we can add up the columns of the document term matrix.
# convert to a dtm
dtm <- DocumentTermMatrix(nyt.corpus)
# calculate the word counts
freq <- colSums(as.matrix(dtm))
freq.df <- data.frame(word = names(freq), count = freq) %>%
arrange(desc(count))
# view the 10 most frequent words
kable(head(freq.df, 10))
word | count |
---|---|
new | 205 |
bush | 115 |
iraq | 98 |
war | 81 |
overview | 80 |
nation | 67 |
campaign | 65 |
president | 61 |
plan | 59 |
report | 59 |
Apparently the articles in the dataset talk a lot about Bush, the Iraq war and the presidential campaigns!
Next, filtering to the 1,000 most common words:
# obtain the 1000 most common words in the New York Times headlines
common.words <- names(sort(freq, decreasing = TRUE))[1:1000]
Obtaining and cleaning the Word2Vec Google News data
Next, we will load in the previously processed GoogleNews Word2Vec data, and filter to the most common words.
load("processed_data/GoogleNews.RData")
GoogleNews.common <- GoogleNews[tolower(GoogleNews[,1]) %in% common.words,]
Next we want to remove the first column corresponding to the words and place these in the row labels.
# the first column contains the words so we want to set the row names accordingly
rownames(GoogleNews.common) <- GoogleNews.common[,1]
# and then remove the first column
GoogleNews.common <- GoogleNews.common[,-1]
Note that many words are repeated, for example, the word “next” has 7 different vectors corresponding to all possible capitalization combinations: “NEXT”, “neXt”, “NExt”, “NeXT”, “NeXt”, and “Next”.
We thus make the simplifying choice to remove all non-lowercase word vectors from our dataset.
# identify all lowercase words
lowercase.words <- rownames(GoogleNews.common) == tolower(rownames(GoogleNews.common))
# restrict to these lowercase words
GoogleNews.common <- GoogleNews.common[lowercase.words, ]
Introducing Superheat
Installing the superheat package from github is easy using the devtools
package. Simply type the following command:
# install devtools if you don't have it already
install.packages("devtools")
# install the development version of superheat
devtools::install_github("rlbarter/superheat")
Assuming that you didn’t run into any unfortunate errors when installing the package, you can load the package in the normal way.
library(superheat)
Visualising cosine similarity for the 40 most common words
Direct visualisation of the raw word vectors themselves is quite uninformative, primarily due to the fact that the original Word2Vec dimensions are somewhat meaningless. Instead, we will visually compare the vectors using cosine similarity, a common similarity metric for Word2Vec data.
First, we restrict to the 40 most common words from the NY Times headlines that are also in the Google News corpus, leaving us with 35 words as there were 5 words from NY Times whose lowercase versions don’t appear in the Google News corpus: “nation”, “clinton”, “bill”, “back”, “set”.
# identify the 40 most common words from the NY Times headlines that are also in the Google News corpus
forty.most.commmon.words <- common.words[1:40][common.words[1:40] %in% rownames(GoogleNews.common)]
length(forty.most.commmon.words)
## [1] 35
Next we need to write a function for computing the pairwise cosine similarity between entries in a matrix. Although the function uses for-loops, it’s actually pretty quick.
CosineFun <- function(x, y){
# calculate the cosine similarity between two vectors: x and y
c <- sum(x*y) / (sqrt(sum(x * x)) * sqrt(sum(y * y)))
return(c)
}
CosineSim <- function(X) {
# calculate the pairwise cosine similarity between columns of the matrix X.
# initialize similarity matrix
m <- matrix(NA,
nrow = ncol(X),
ncol = ncol(X),
dimnames = list(colnames(X), colnames(X)))
cos <- as.data.frame(m)
# calculate the pairwise cosine similarity
for(i in 1:ncol(X)) {
for(j in i:ncol(X)) {
co_rate_1 <- X[which(X[, i] & X[, j]), i]
co_rate_2 <- X[which(X[, i] & X[, j]), j]
cos[i, j] <- CosineFun(co_rate_1, co_rate_2)
# fill in the opposite diagonal entry
cos[j, i] <- cos[i, j]
}
}
return(cos)
}
Below, we calculate the cosine similarity matrix for these 35 words.
# calculate the cosine similarity between the forty most common words
cosine.similarity <- CosineSim(t(GoogleNews.common[forty.most.commmon.words, ]))
Since the diagonal similarity values are all 1 (the similarity of a word with itself is 1), and this can skew the color scale, we make a point of setting these values to NA.
diag(cosine.similarity) <- NA
Using superheat to visualise the cosine similarity matrix between the set of the 35 most common words, and adding a dendrogram to aide comparisons, we paint a very clear (and pretty!) picture of the relationships between these words.
superheat(cosine.similarity,
# place dendrograms on columns and rows
row.dendrogram = T,
col.dendrogram = T,
# make gridlines white for enhanced prettiness
grid.hline.col = "white",
grid.vline.col = "white",
# rotate bottom label text
bottom.label.text.angle = 90,
legend.breaks = c(-0.1, 0.1, 0.3, 0.5))