A speedy introduction to Word2Vec

To blatantly quote the Wikipedia article on Word2Vec:

Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec takes as its input a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space

Basically, Word2Vec was developed by a bunch of clever Googlers as a way to represent words as high-dimensional vectors such that the relative positions of these vectors in space is meaningful in terms of linguistic context. Somehow, this embedding gives rise to algorithmic magic in which equations such as

king - queen = man - woman

and its logical extension,

king - queen + woman = man

make sense.

Most techniques for visualising relationships between high-dimensional word vectors rely on dimensionality reduction techniques, such as projection onto principal components, or using t-SNE. Some of these visualisations are actually pretty cool.

This case study introduces an alternative approach to visualising the relationship between word vectors in R that does not rely on dimensionality reduction. Welcome to the beautiful world of superheatmaps!

Preparing the data

# load in some useful libraries
library(knitr)
library(dplyr)
library(reshape2)
library(cluster)
library(ggplot2)

The Google News Word2Vec data can be found here. A long time ago, I used the convert_Word2Vec.py Python script to convert it to a .csv, and then read it into R and saved it as an .RData file. This script was adapted from Franck Dernoncourt’s response to this Stack Overflow post

Identifying the most common words from the NY Times headlines

First, we want to limit our data to some list of common words. Specifically, we will identify the most common words from New York Times headlines compiled by Professor Amber E. Boydstun at UC Davis. The headlines data can be obtained from the RTextTools R package. Having identified these common words, we will then filter our GoogleNews Word2Vec dataset to these common words.

library(RTextTools)
# The NYTimes dataset can be extracted from the RTextTools package
data(NYTimes)
# view the first 6 rows
kable(head(NYTimes, 3))
Article_ID Date Title Subject Topic.Code
41246 1-Jan-96 Nation’s Smaller Jails Struggle To Cope With Surge in Inmates Jails overwhelmed with hardened criminals 12
41257 2-Jan-96 FEDERAL IMPASSE SADDLING STATES WITH INDECISION Federal budget impasse affect on states 20
41268 3-Jan-96 Long, Costly Prelude Does Little To Alter Plot of Presidential Race Contenders for 1996 Presedential elections 20

In preparation for calculating word frequencies, we first restrict to the headlines, and will then do some pre-processing using the tm package.

library(tm)
# Extract the title from each article entry
nyt.headlines <- as.character(NYTimes$Title)
# convert the headline titles to a tm corpus
nyt.corpus <- Corpus(VectorSource(nyt.headlines))

The pre-processing steps involve removing stop-words like “a”, “and”, and “the”, converting all words to lowercase, and remove punctuation, numbers and spacing.

# convert to lowercase
nyt.corpus <- tm_map(nyt.corpus, content_transformer(tolower))
# remove punctuation
nyt.corpus <- tm_map(nyt.corpus, removePunctuation)
# remove numebrs
nyt.corpus <- tm_map(nyt.corpus, removeNumbers)
# remove whitespace
nyt.corpus <- tm_map(nyt.corpus, stripWhitespace)
# remove stop words
nyt.corpus <- tm_map(nyt.corpus, removeWords, stopwords("english"))

Next to get the word counts, we can add up the columns of the document term matrix.

# convert to a dtm
dtm <- DocumentTermMatrix(nyt.corpus)
# calculate the word counts
freq <- colSums(as.matrix(dtm))   
freq.df <- data.frame(word = names(freq), count = freq) %>%
  arrange(desc(count))
# view the 10 most frequent words
kable(head(freq.df, 10))
word count
new 205
bush 115
iraq 98
war 81
overview 80
nation 67
campaign 65
president 61
plan 59
report 59

Apparently the articles in the dataset talk a lot about Bush, the Iraq war and the presidential campaigns!

Next, filtering to the 1,000 most common words:

# obtain the 1000 most common words in the New York Times headlines
common.words <- names(sort(freq, decreasing = TRUE))[1:1000]

Obtaining and cleaning the Word2Vec Google News data

Next, we will load in the previously processed GoogleNews Word2Vec data, and filter to the most common words.

load("processed_data/GoogleNews.RData")
GoogleNews.common <- GoogleNews[tolower(GoogleNews[,1]) %in% common.words,]

Next we want to remove the first column corresponding to the words and place these in the row labels.

# the first column contains the words so we want to set the row names accordingly
rownames(GoogleNews.common) <- GoogleNews.common[,1]
# and then remove the first column
GoogleNews.common <- GoogleNews.common[,-1]

Note that many words are repeated, for example, the word “next” has 7 different vectors corresponding to all possible capitalization combinations: “NEXT”, “neXt”, “NExt”, “NeXT”, “NeXt”, and “Next”.

We thus make the simplifying choice to remove all non-lowercase word vectors from our dataset.

# identify all lowercase words
lowercase.words <- rownames(GoogleNews.common) == tolower(rownames(GoogleNews.common))
# restrict to these lowercase words
GoogleNews.common <- GoogleNews.common[lowercase.words, ]

Introducing Superheat

Installing the superheat package from github is easy using the devtools package. Simply type the following command:

# install devtools if you don't have it already
install.packages("devtools")
# install the development version of superheat
devtools::install_github("rlbarter/superheat")

Assuming that you didn’t run into any unfortunate errors when installing the package, you can load the package in the normal way.

library(superheat)

Visualising cosine similarity for the 40 most common words

Direct visualisation of the raw word vectors themselves is quite uninformative, primarily due to the fact that the original Word2Vec dimensions are somewhat meaningless. Instead, we will visually compare the vectors using cosine similarity, a common similarity metric for Word2Vec data.

First, we restrict to the 40 most common words from the NY Times headlines that are also in the Google News corpus, leaving us with 35 words as there were 5 words from NY Times whose lowercase versions don’t appear in the Google News corpus: “nation”, “clinton”, “bill”, “back”, “set”.

# identify the 40 most common words from the NY Times headlines that are also in the Google News corpus
forty.most.commmon.words <- common.words[1:40][common.words[1:40] %in% rownames(GoogleNews.common)]
length(forty.most.commmon.words)
## [1] 35

Next we need to write a function for computing the pairwise cosine similarity between entries in a matrix. Although the function uses for-loops, it’s actually pretty quick.

CosineFun <- function(x, y){
  # calculate the cosine similarity between two vectors: x and y
  c <- sum(x*y) / (sqrt(sum(x * x)) * sqrt(sum(y * y)))
  return(c)
}

CosineSim <- function(X) {
  # calculate the pairwise cosine similarity between columns of the matrix X.
  # initialize similarity matrix
  m <- matrix(NA, 
              nrow = ncol(X),
              ncol = ncol(X),
              dimnames = list(colnames(X), colnames(X)))
  cos <- as.data.frame(m)
  
  # calculate the pairwise cosine similarity
  for(i in 1:ncol(X)) {
    for(j in i:ncol(X)) {
      co_rate_1 <- X[which(X[, i] & X[, j]), i]
      co_rate_2 <- X[which(X[, i] & X[, j]), j]  
      cos[i, j] <- CosineFun(co_rate_1, co_rate_2)
      # fill in the opposite diagonal entry
      cos[j, i] <- cos[i, j]        
    }
  }
  return(cos)
}

Below, we calculate the cosine similarity matrix for these 35 words.

# calculate the cosine similarity between the forty most common words
cosine.similarity <- CosineSim(t(GoogleNews.common[forty.most.commmon.words, ]))

Since the diagonal similarity values are all 1 (the similarity of a word with itself is 1), and this can skew the color scale, we make a point of setting these values to NA.

diag(cosine.similarity) <- NA

Using superheat to visualise the cosine similarity matrix between the set of the 35 most common words, and adding a dendrogram to aide comparisons, we paint a very clear (and pretty!) picture of the relationships between these words.

superheat(cosine.similarity, 

          # place dendrograms on columns and rows 
          row.dendrogram = T, 
          col.dendrogram = T,
          
          # make gridlines white for enhanced prettiness
          grid.hline.col = "white",
          grid.vline.col = "white",
          
          # rotate bottom label text
          bottom.label.text.angle = 90,
          
          legend.breaks = c(-0.1, 0.1, 0.3, 0.5))

For example, we can see that “vote” and “senate”, “war” and “iraq”, as well as “court” and “case” have similar word vectors. In addition, there appear to be three primary groups of words, the bottom left corner shows that words concerning the political campaign and the Iraq war are very similar, whereas the center block shows that words such as “may” and “will” appear in similar contexts.

Visualising word clusters for the 1000 most common words

In order to visualise more than 30 or so words at once, we can cluster words into meaningful groups.

Below we define a new cluster similarity matrix that corresponds to the 1,000 most common words from the NY Times headlines. Note that again, we actually have only 855 words since the lowercase version of some common words from the NY Times headlines didn’t appear in the Google News corpus.

# calculate the cosine similarity between the 1,000 most common words
cosine.similarity.full <- CosineSim(t(GoogleNews.common))
dim(cosine.similarity.full)
## [1] 858 858

We now want to cluster this cosine similarity matrix.

In order to select the number of clusters, we will choose the number that not only effectively captures distinct groupings, but also has the highest cluster stability (i.e. the same clusters are generated when using sub-samples of the data).

Choosing the number of clusters

We use two criteria to select the number of clusters, \(k\):

  1. performance-based (cosine silhouette width): the average (cosine) silhouette width for each value of \(k\).

  2. stability-based (Jaccard similarity): the average pairwise Jaccard similarity between clusters based on 90% sub-samples of the data.

Cosine silhouette width

From each cluster iteration, and for each value of \(k\) we can calculate the average silhouette width. The silhouette width is a traditional measure of cluster quality based on how well each object lies within its cluster. We adapted the definition to suit cosine-based distance so that the cosine-silhouette width for data point \(i\) is defined to be:

\[s^{\text{cosine}}(i) = b(i) - a(i)\]

where

\[a(i) = \frac{1}{\|C_i\|} \sum_{j \in C_i} d_{\text{cosine}}(x_i, x_j)\]

is the average cosine-dissimilarity of \(i\) with all other data within the same cluster, and

\[b(i) = \min_{C \neq C_i} d_{\text{cosine}}(x_i, C)\]

is the lowest average dissimilarity of \(i\) to any other cluster of which \(i\) is not a member. Here \(C_i\) is the index set of the cluster to which \(i\) belongs, and \(d_{\text{cosine}}(x, y)\) is a measure of cosine “distance”, which is equal to

\[d_{\text{cosine}} = \frac{\cos^{-1}(s_{\text{cosine}})}{\pi}\]

(where \(s_{\text{cosine}}\) is standard cosine similarity).

The function below calculates the silhouette width for every data point provided in the cosine.matrix argument based on the clusters provided by the membership argument.

# calculate the cosine silhouette width, which in cosine land is 
# (1) the lowest average dissimilarity of the data point to any other cluster, 
#  minus
# (2) the average dissimilarity of the data point to all other data points in 
#     the same cluster
cosineSilhouette <- function(cosine.matrix, membership) {
  # Args:
  #   cosine.matrix: the cosine similarity matrix for the words
  #   membership: the named membership vector for the rows and columns. 
  #               The entries should be cluster centers and the vector 
  #               names should be the words.
  if (!is.factor(membership)) {
    stop("membership must be a factor")
  }
  # note that there are some floating point issues:
  # (some "1" entires are actually sliiightly larger than 1)
  cosine.dissim <- acos(round(cosine.matrix, 10)) / pi
  widths.list <- lapply(levels(membership), function(clust) {
    # filter rows of the similarity matrix to words in the current cluster
    # filter cols of the similarity matrix to words in the current cluster
    cosine.matrix.inside <- cosine.dissim[membership == clust, 
                                          membership == clust]
    # a: average dissimilarity of i with all other data in the same cluster
    a <- apply(cosine.matrix.inside, 1, mean)
    # filter rows of the similarity matrix to words in the current cluster
    # filter cols of the similarity matrix to words NOT in the current cluster
    other.clusters <- levels(membership)[levels(membership) != clust]
    cosine.matrix.outside <- sapply(other.clusters, function(other.clust) {
      cosine.dissim[membership == clust, membership == other.clust] %>%
        apply(1, mean) # average over clusters
    })
    # b is the lowest average dissimilarity of i to any other cluster of 
    # which i is not a member
    b <- apply(cosine.matrix.outside, 1, min)
    # silhouette width is b - a
    cosine.sil.width <- b - a
    data.frame(word = names(cosine.sil.width), width = cosine.sil.width)
  })
  widths.list <- do.call(rbind, widths.list)
  # join membership onto data.frame
  membership.df <- data.frame(word = names(membership), 
                              membership = membership)
  widths.list <- left_join(widths.list, membership.df, by = "word")
  return(widths.list)
}

Using the cosineSilhouette() function to calculating the average cosine silhouette width for each \(k\), we can plot \(k\) versus average cosine silhouette width across all observations for each number of clusters, \(k\).

set.seed(238942)
# calculate the average silhouette width for k=5, ..., 20
sil.width <- sapply(5:20, function(k) {
  # generate k clusters
  membership <- pam(cosine.similarity.full, k = k)
  # calcualte the silhouette width for each observation
  width <- cosineSilhouette(cosine.similarity.full, 
                   membership = factor(membership$clustering))$width
  return(mean(width))
})
# plot k verus silhouette width
data.frame(k = 5:20, width = sil.width) %>%
  ggplot(aes(x = k, y = width)) +
  geom_line() + 
  geom_point() +
  scale_y_continuous(name = "Avergae silhouette width")

It is fairly obvious from this plot that the best \(k\) is \(k = 11\).

Jaccard Similarity

Next, for each set of cluster membership pairs (where each membership vector is calculated based on a 90% sub-sample of the data), we want to calculate the Jaccard similarity of the membership vectors.

Since each membership iteration corresponds to a 90% sub-sample, we ignore words that are missing from either of the iterations.

The generateClusters() function generates k clusters (where we range k over some set of values such as 5 to 20). We do this N times, each time taking a subset of 90% of the data.

Based on these N iterations of clusters we will evaluate both performance and stability and select a number of clusters based on these criterion.

library(cluster)

# perform clustering for k in k.range clusters over N 90% sub-samples 
generateClusters <- function(similarity.mat, k.range, N) {
  random.subset.list <- lapply(1:100, function(i) {
    sample(1:nrow(similarity.mat), 0.9 * nrow(similarity.mat))
    })
  lapply(k.range, function(k) {
    print(paste("k =", k))
    lapply(1:N, function(i) {
      # randomly sample 90% of words
      cosine.sample <- similarity.mat[random.subset.list[[i]], random.subset.list[[i]]]
      # perform clustering
      pam.clusters <- pam(1 - cosine.sample, k = k, diss = TRUE)
    })
  })
}

We decide to test the range of \(k = 5, ..., 20\) clusters and repeat each of these clusterings across 100 different 90% sub-sample.

# generate clusters ranging from 5 to 20 cluster groups for each of 100 subsamples
# This will take a little while to run
cluster.iterations <- generateClusters(cosine.similarity.full, 
                                       k.range = 5:20, 
                                       N = 100)

We next need to clean the results into a nice format. The outer list of join.cluster.iterations below corresponds to each k value. Each list entry is a data frame for a single subsample in which the first column corresponds to the word and the remaining columns correspond to the word cluster for each value of \(k\).

# clean the simulation structure
join.cluster.iterations <- lapply(cluster.iterations, function(list) {
  # for each list of iterations (for a specific k), 
  # full-join the membership vectors into a data frame 
  # (there will be missing values in each column)
  Reduce(function(x, y) full_join(x, y, by = "words"), 
    lapply(list, function(cluster.obj) {
      df <- data.frame(words = names(cluster.obj$clustering), 
                 clusters = cluster.obj$clustering)
      }))
  })
# clean column names 
join.cluster.iterations <- lapply(join.cluster.iterations, function(x) {
  colnames(x) <- c("words", paste0("membership", 1:100))
  return(x)
  })

Below we print the first six entries of the membership vectors for 7 of the 100 iterations when \(k = 5\). Notice that there are NA values: these correspond to words that were omitted in the 90% subsample. There are actually 100 membership rows in the full data frame, each corresponding to an iteration of PAM with \(k=5\) on a 90% subsample.

# view the first 8 columns of the first data frame (correpsonding to k=5)
kable(head(join.cluster.iterations[[1]][, 1:8]))
words membership1 membership2 membership3 membership4 membership5 membership6 membership7
force 1 1 NA 2 1 1 1
issues 2 5 NA 1 4 4 NA
newark 3 3 1 5 3 2 2
justices 1 5 5 2 1 1 1
iraqi 3 3 1 5 3 2 2
stock 1 2 4 NA 1 1 1

Next for each pair of these cluster iterations, we can calculate the Jaccard similarity. To speed things up, we wrote a Jaccard function in C++. Note also that to avoid correlations, we take independent pairs: e.g. we calculate the Jaccard similarity between the membership vector from iterations 1 and 2, and then from 3 and 4, then from 5 and 6, etc. This means that for each value of \(k\) we calculate 50 Jaccard similarity values.

# calculate the pairwise jaccard similarity between each of the cluster 
# memberships accross the common words
# to avoid correlation, we do this pairwise between simulations 1 and 2, 
# and then between simulations 3 and 4, and so on
library(Rcpp)
library(reshape2)
# use Rcpp to speed up the computation
sourceCpp('code/Rcpp_similarity.cpp')
jaccard.similarity <- sapply(join.cluster.iterations, 
       function(cluster.iteration) {
        sapply(seq(2, ncol(cluster.iteration) - 1, by = 2), 
             function(i) {
               # calculate the Jaccard similarity between each pair of columns
               cluster.iteration.pair <- cluster.iteration[ , c(i, i + 1)]
               colnames(cluster.iteration.pair) <- c("cluster1", "cluster2")
               # remove words that do not appear in both 90% sub-samples
               cluster.iteration.pair <- cluster.iteration.pair %>%
                 filter(!is.na(cluster1), !is.na(cluster2))
               # Calcualte the Jaccard similarity between the two cluster vectors
               RcppSimilarity(cluster.iteration.pair[ , 1], 
                              cluster.iteration.pair[ , 2])
             })
  })

We next want to melt jaccard.similarity into a long-form data frame that is easy to manipulate for visualisation.

# average similarity over simulations
jaccard.similarity.long <- melt(jaccard.similarity)
colnames(jaccard.similarity.long) <- c("iter", "k", "similarity")
# k is the number of clusters
jaccard.similarity.long$k <- jaccard.similarity.long$k + 4
jaccard.similarity.long <- jaccard.similarity.long %>% 
  filter(k <= 20)
# average over iterations
jaccard.similarity.avg <- jaccard.similarity.long %>% 
  group_by(k) %>% 
  summarise(similarity = mean(similarity))

Plotting the Jaccard similarity for each \(k\), we find that as \(k\) increase, the variability decreases, but so too does the similarity!

# plot number of clusters versus Jaccard similarity
ggplot(jaccard.similarity.long) + 
  geom_boxplot(aes(x = k, y = similarity, group = k)) +
  geom_line(aes(x = k, y = similarity), 
            linetype = "dashed",
            data = jaccard.similarity.avg) +
  ggtitle("Jaccard similarity versus k")

It seems as though for values of \(k\) from 8 to 20, there is little difference in the Jaccard similarity (with an average Jaccard similarity value close to 0.5), thus we maintain that \(k=11\) is a suitable choice.

Clustering with 11 clusters

Having decided that \(k=11\) is an appropriate number of clusters to generate, we use PAM on the full cosine distance matrix (it is easier to provide a dissimilarity matrix than it is to provide a similarity matrix when clustering). Later we will compare the clustering we obtain when we set \(k=12\) so we will compute this too.

# note that there are some floating point issues in the similarity matrix:
# some "1" entires are actually sliiightly larger than 1, so we round to 
# the nearest 10 dp when calcualting the distance matrix
word.clusters <- pam(acos(round(cosine.similarity.full, 10)) / pi, k = 11, diss = TRUE)
word.clusters.12 <- pam(acos(round(cosine.similarity.full, 10)) / pi, k = 12, diss = TRUE)

We then define the cluster labels to be the medoid (center) word for that cluster (recall that PAM is like k-means but requires that the center of the cluster is a data point).

# print the cluster medoids
word.clusters$medoids
##  [1] "just"       "struggle"   "american"   "north"      "government"
##  [6] "pushes"     "give"       "murder"     "proposal"   "bombing"   
## [11] "companies"
# convert the membership vector to a factor
word.membership <- factor(word.clusters$clustering)

# print the cluster medoids
word.clusters.12$medoids
##  [1] "just"       "struggle"   "american"   "north"      "government"
##  [6] "pushes"     "children"   "give"       "murder"     "proposal"  
## [11] "bombing"    "companies"
# convert the membership vector to a factor
word.membership.12 <- factor(word.clusters.12$clustering)
# replace integer membership by medoid membership
levels(word.membership) <- word.clusters$medoids
# replace integer membership by medoid membership
levels(word.membership.12) <- word.clusters.12$medoids

Next, we compare an example of clustering with \(k = 11\) and \(k = 12\). For the most part, we are curious about which clusters would be forced to split into two or more clusters.Below we plot a superheatmap displaying the proportion of words in each cluster that are the same.

# compare the membership vectors with 12 and 13 clusters
word.membership.split <- split(word.membership, word.membership)
word.membership.split.12 <- split(word.membership.12, word.membership.12)
compare.11.12 <- sapply(word.membership.split, function(i) {
  sapply(word.membership.split.12, function(j) {
    sum(names(i) %in% names(j)) / length(i)
  })
})
superheat(compare.11.12, 
          heat.pal = c("white", "grey", "black"),
          heat.pal.values = c(0, 0.1, 1),
          column.title = "11 clusters",
          row.title = "12 clusters",
          bottom.label.text.angle = 90,
          bottom.label.size = 0.4)

## quartz_off_screen 
##                 2

Plotting a clustered superheatmap with silhouette plot

Next, we can calculate the cosine silhouette width for each word. We will plot this above the columns of our superheatmap.

# calcualte the cosine silhouette width
cosine.silhouette <- 
  cosineSilhouette(cosine.similarity.full, word.membership)
# arrange the words in the same order as the original matrix
rownames(cosine.silhouette) <- cosine.silhouette$word
cosine.silhouette <- cosine.silhouette[rownames(cosine.similarity.full), ]

Next, we want to order the clusters in order of average silhouette width.

# calculate the average width for each cluster
avg.sil.width <- cosine.silhouette %>% 
  group_by(membership) %>% 
  summarise(avg.width = mean(width)) %>% 
  arrange(avg.width)
# add a blank space after each word (for aesthetic purposes)
word.membership.padded <- paste0(word.membership, " ")
# reorder levels based on increasing separation
word.membership.padded <- factor(word.membership.padded, 
                          levels = paste0(avg.sil.width$membership, " "))

We are now ready to plot a clustered superheatmap.

## quartz_off_screen 
##                 2
superheat(cosine.similarity.full,
          
          # row and column clustering
          membership.rows = word.membership.padded,
          membership.cols = word.membership.padded,
          
          # top plot: silhouette
          yt = cosine.silhouette$width,
          yt.axis.name = "Cosine\nsilhouette\nwidth",
          yt.plot.type = "bar",
          yt.bar.col = "grey35",
          
          # order of rows and columns within clusters
          order.rows = order(cosine.silhouette$width),
          order.cols = order(cosine.silhouette$width),
          
          # bottom labels
          bottom.label.col = c("grey95", "grey80"),
          bottom.label.text.angle = 90,
          bottom.label.text.alignment = "right",
          bottom.label.size = 0.28,
          
          # left labels
          left.label.col = c("grey95", "grey80"),
          left.label.text.alignment = "right",
          left.label.size = 0.26,
          
          # title
          title = "(a)")

Due to the amount of information it contains, the heatmap above appears very grainy, so we decide to smooth within each cluster (take the median value of each cell in the cluster).

## quartz_off_screen 
##                 2
superheat(cosine.similarity.full, 
          
          # row and column clustering
          membership.rows = word.membership.padded,
          membership.cols = word.membership.padded,
          
          # top plot: silhouette
          yt = cosine.silhouette$width,
          yt.axis.name = "Cosine\nsilhouette\nwidth",
          yt.plot.type = "bar",
          yt.bar.col = "grey35",
          
          # order of rows and columns within clusters
          order.rows = order(cosine.silhouette$width),
          order.cols = order(cosine.silhouette$width),
          
          # bottom labels
          bottom.label.col = c("grey95", "grey80"),
          bottom.label.text.angle = 90,
          bottom.label.text.alignment = "right",
          bottom.label.size = 0.28,
          
          # left labels
          left.label.col = c("grey95", "grey80"),
          left.label.text.alignment = "right",
          left.label.size = 0.26,
          
          # smooth heatmap within clusters
          smooth.heat = T,
          
          # title
          title = "(b)")

Cluster word clouds

Lastly, we produce some word clouds to identify the members of each cluster. The function below produces a word cloud for a specific word cluster.

library(RColorBrewer)
library(wordcloud)
# define a function that takes the cluster name and the membership vector 
# and returns a word cloud
makeWordCloud <- function(cluster, word.membership, words.freq) {
  words <- names(word.membership[word.membership == cluster])
  words.freq <- words.freq[words]
  # make all words black except for the cluster center
  words.col <- rep("black", length = length(words.freq))
  words.col[words == cluster] <- "red"
  # the size of the words will be the frequency from the NY Times headlines
  wordcloud(words, words.freq, colors = words.col, 
            ordered.colors = TRUE, random.order = FALSE, max.words = 80)
}

In each word cloud, the cluster center is in red, and the size corresponds to the word frequency from the NY Times headlines. It is thus interesting to notice that oftentimes the cluster center has fairly low frequency.

# plot word clouds
set.seed(52545)
for (word in levels(word.membership)) {
  makeWordCloud(word, word.membership, words.freq = freq)
}