To blatantly quote the Wikipedia article on Word2Vec:
Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec takes as its input a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space
Basically, Word2Vec was developed by a bunch of clever Googlers as a way to represent words as high-dimensional vectors such that the relative positions of these vectors in space is meaningful in terms of linguistic context. Somehow, this embedding gives rise to algorithmic magic in which equations such as
king - queen = man - woman
and its logical extension,
king - queen + woman = man
make sense.
Most techniques for visualising relationships between high-dimensional word vectors rely on dimensionality reduction techniques, such as projection onto principal components, or using t-SNE. Some of these visualisations are actually pretty cool.
This case study introduces an alternative approach to visualising the relationship between word vectors in R that does not rely on dimensionality reduction. Welcome to the beautiful world of superheatmaps!
# load in some useful libraries
library(knitr)
library(dplyr)
library(reshape2)
library(cluster)
library(ggplot2)
The Google News Word2Vec data can be found here. A long time ago, I used the convert_Word2Vec.py
Python script to convert it to a .csv, and then read it into R and saved it as an .RData file. This script was adapted from Franck Dernoncourt’s response to this Stack Overflow post
First, we want to limit our data to some list of common words. Specifically, we will identify the most common words from New York Times headlines compiled by Professor Amber E. Boydstun at UC Davis. The headlines data can be obtained from the RTextTools
R package. Having identified these common words, we will then filter our GoogleNews Word2Vec dataset to these common words.
library(RTextTools)
# The NYTimes dataset can be extracted from the RTextTools package
data(NYTimes)
# view the first 6 rows
kable(head(NYTimes, 3))
Article_ID | Date | Title | Subject | Topic.Code |
---|---|---|---|---|
41246 | 1-Jan-96 | Nation’s Smaller Jails Struggle To Cope With Surge in Inmates | Jails overwhelmed with hardened criminals | 12 |
41257 | 2-Jan-96 | FEDERAL IMPASSE SADDLING STATES WITH INDECISION | Federal budget impasse affect on states | 20 |
41268 | 3-Jan-96 | Long, Costly Prelude Does Little To Alter Plot of Presidential Race | Contenders for 1996 Presedential elections | 20 |
In preparation for calculating word frequencies, we first restrict to the headlines, and will then do some pre-processing using the tm
package.
library(tm)
# Extract the title from each article entry
nyt.headlines <- as.character(NYTimes$Title)
# convert the headline titles to a tm corpus
nyt.corpus <- Corpus(VectorSource(nyt.headlines))
The pre-processing steps involve removing stop-words like “a”, “and”, and “the”, converting all words to lowercase, and remove punctuation, numbers and spacing.
# convert to lowercase
nyt.corpus <- tm_map(nyt.corpus, content_transformer(tolower))
# remove punctuation
nyt.corpus <- tm_map(nyt.corpus, removePunctuation)
# remove numebrs
nyt.corpus <- tm_map(nyt.corpus, removeNumbers)
# remove whitespace
nyt.corpus <- tm_map(nyt.corpus, stripWhitespace)
# remove stop words
nyt.corpus <- tm_map(nyt.corpus, removeWords, stopwords("english"))
Next to get the word counts, we can add up the columns of the document term matrix.
# convert to a dtm
dtm <- DocumentTermMatrix(nyt.corpus)
# calculate the word counts
freq <- colSums(as.matrix(dtm))
freq.df <- data.frame(word = names(freq), count = freq) %>%
arrange(desc(count))
# view the 10 most frequent words
kable(head(freq.df, 10))
word | count |
---|---|
new | 205 |
bush | 115 |
iraq | 98 |
war | 81 |
overview | 80 |
nation | 67 |
campaign | 65 |
president | 61 |
plan | 59 |
report | 59 |
Apparently the articles in the dataset talk a lot about Bush, the Iraq war and the presidential campaigns!
Next, filtering to the 1,000 most common words:
# obtain the 1000 most common words in the New York Times headlines
common.words <- names(sort(freq, decreasing = TRUE))[1:1000]
Next, we will load in the previously processed GoogleNews Word2Vec data, and filter to the most common words.
load("processed_data/GoogleNews.RData")
GoogleNews.common <- GoogleNews[tolower(GoogleNews[,1]) %in% common.words,]
Next we want to remove the first column corresponding to the words and place these in the row labels.
# the first column contains the words so we want to set the row names accordingly
rownames(GoogleNews.common) <- GoogleNews.common[,1]
# and then remove the first column
GoogleNews.common <- GoogleNews.common[,-1]
Note that many words are repeated, for example, the word “next” has 7 different vectors corresponding to all possible capitalization combinations: “NEXT”, “neXt”, “NExt”, “NeXT”, “NeXt”, and “Next”.
We thus make the simplifying choice to remove all non-lowercase word vectors from our dataset.
# identify all lowercase words
lowercase.words <- rownames(GoogleNews.common) == tolower(rownames(GoogleNews.common))
# restrict to these lowercase words
GoogleNews.common <- GoogleNews.common[lowercase.words, ]
Installing the superheat package from github is easy using the devtools
package. Simply type the following command:
# install devtools if you don't have it already
install.packages("devtools")
# install the development version of superheat
devtools::install_github("rlbarter/superheat")
Assuming that you didn’t run into any unfortunate errors when installing the package, you can load the package in the normal way.
library(superheat)
Direct visualisation of the raw word vectors themselves is quite uninformative, primarily due to the fact that the original Word2Vec dimensions are somewhat meaningless. Instead, we will visually compare the vectors using cosine similarity, a common similarity metric for Word2Vec data.
First, we restrict to the 40 most common words from the NY Times headlines that are also in the Google News corpus, leaving us with 35 words as there were 5 words from NY Times whose lowercase versions don’t appear in the Google News corpus: “nation”, “clinton”, “bill”, “back”, “set”.
# identify the 40 most common words from the NY Times headlines that are also in the Google News corpus
forty.most.commmon.words <- common.words[1:40][common.words[1:40] %in% rownames(GoogleNews.common)]
length(forty.most.commmon.words)
## [1] 35
Next we need to write a function for computing the pairwise cosine similarity between entries in a matrix. Although the function uses for-loops, it’s actually pretty quick.
CosineFun <- function(x, y){
# calculate the cosine similarity between two vectors: x and y
c <- sum(x*y) / (sqrt(sum(x * x)) * sqrt(sum(y * y)))
return(c)
}
CosineSim <- function(X) {
# calculate the pairwise cosine similarity between columns of the matrix X.
# initialize similarity matrix
m <- matrix(NA,
nrow = ncol(X),
ncol = ncol(X),
dimnames = list(colnames(X), colnames(X)))
cos <- as.data.frame(m)
# calculate the pairwise cosine similarity
for(i in 1:ncol(X)) {
for(j in i:ncol(X)) {
co_rate_1 <- X[which(X[, i] & X[, j]), i]
co_rate_2 <- X[which(X[, i] & X[, j]), j]
cos[i, j] <- CosineFun(co_rate_1, co_rate_2)
# fill in the opposite diagonal entry
cos[j, i] <- cos[i, j]
}
}
return(cos)
}
Below, we calculate the cosine similarity matrix for these 35 words.
# calculate the cosine similarity between the forty most common words
cosine.similarity <- CosineSim(t(GoogleNews.common[forty.most.commmon.words, ]))
Since the diagonal similarity values are all 1 (the similarity of a word with itself is 1), and this can skew the color scale, we make a point of setting these values to NA.
diag(cosine.similarity) <- NA
Using superheat to visualise the cosine similarity matrix between the set of the 35 most common words, and adding a dendrogram to aide comparisons, we paint a very clear (and pretty!) picture of the relationships between these words.
superheat(cosine.similarity,
# place dendrograms on columns and rows
row.dendrogram = T,
col.dendrogram = T,
# make gridlines white for enhanced prettiness
grid.hline.col = "white",
grid.vline.col = "white",
# rotate bottom label text
bottom.label.text.angle = 90,
legend.breaks = c(-0.1, 0.1, 0.3, 0.5))
For example, we can see that “vote” and “senate”, “war” and “iraq”, as well as “court” and “case” have similar word vectors. In addition, there appear to be three primary groups of words, the bottom left corner shows that words concerning the political campaign and the Iraq war are very similar, whereas the center block shows that words such as “may” and “will” appear in similar contexts.
In order to visualise more than 30 or so words at once, we can cluster words into meaningful groups.
Below we define a new cluster similarity matrix that corresponds to the 1,000 most common words from the NY Times headlines. Note that again, we actually have only 855 words since the lowercase version of some common words from the NY Times headlines didn’t appear in the Google News corpus.
# calculate the cosine similarity between the 1,000 most common words
cosine.similarity.full <- CosineSim(t(GoogleNews.common))
dim(cosine.similarity.full)
## [1] 858 858
We now want to cluster this cosine similarity matrix.
In order to select the number of clusters, we will choose the number that not only effectively captures distinct groupings, but also has the highest cluster stability (i.e. the same clusters are generated when using sub-samples of the data).
We use two criteria to select the number of clusters, \(k\):
performance-based (cosine silhouette width): the average (cosine) silhouette width for each value of \(k\).
stability-based (Jaccard similarity): the average pairwise Jaccard similarity between clusters based on 90% sub-samples of the data.
From each cluster iteration, and for each value of \(k\) we can calculate the average silhouette width. The silhouette width is a traditional measure of cluster quality based on how well each object lies within its cluster. We adapted the definition to suit cosine-based distance so that the cosine-silhouette width for data point \(i\) is defined to be:
\[s^{\text{cosine}}(i) = b(i) - a(i)\]
where
\[a(i) = \frac{1}{\|C_i\|} \sum_{j \in C_i} d_{\text{cosine}}(x_i, x_j)\]
is the average cosine-dissimilarity of \(i\) with all other data within the same cluster, and
\[b(i) = \min_{C \neq C_i} d_{\text{cosine}}(x_i, C)\]
is the lowest average dissimilarity of \(i\) to any other cluster of which \(i\) is not a member. Here \(C_i\) is the index set of the cluster to which \(i\) belongs, and \(d_{\text{cosine}}(x, y)\) is a measure of cosine “distance”, which is equal to
\[d_{\text{cosine}} = \frac{\cos^{-1}(s_{\text{cosine}})}{\pi}\]
(where \(s_{\text{cosine}}\) is standard cosine similarity).
The function below calculates the silhouette width for every data point provided in the cosine.matrix
argument based on the clusters provided by the membership
argument.
# calculate the cosine silhouette width, which in cosine land is
# (1) the lowest average dissimilarity of the data point to any other cluster,
# minus
# (2) the average dissimilarity of the data point to all other data points in
# the same cluster
cosineSilhouette <- function(cosine.matrix, membership) {
# Args:
# cosine.matrix: the cosine similarity matrix for the words
# membership: the named membership vector for the rows and columns.
# The entries should be cluster centers and the vector
# names should be the words.
if (!is.factor(membership)) {
stop("membership must be a factor")
}
# note that there are some floating point issues:
# (some "1" entires are actually sliiightly larger than 1)
cosine.dissim <- acos(round(cosine.matrix, 10)) / pi
widths.list <- lapply(levels(membership), function(clust) {
# filter rows of the similarity matrix to words in the current cluster
# filter cols of the similarity matrix to words in the current cluster
cosine.matrix.inside <- cosine.dissim[membership == clust,
membership == clust]
# a: average dissimilarity of i with all other data in the same cluster
a <- apply(cosine.matrix.inside, 1, mean)
# filter rows of the similarity matrix to words in the current cluster
# filter cols of the similarity matrix to words NOT in the current cluster
other.clusters <- levels(membership)[levels(membership) != clust]
cosine.matrix.outside <- sapply(other.clusters, function(other.clust) {
cosine.dissim[membership == clust, membership == other.clust] %>%
apply(1, mean) # average over clusters
})
# b is the lowest average dissimilarity of i to any other cluster of
# which i is not a member
b <- apply(cosine.matrix.outside, 1, min)
# silhouette width is b - a
cosine.sil.width <- b - a
data.frame(word = names(cosine.sil.width), width = cosine.sil.width)
})
widths.list <- do.call(rbind, widths.list)
# join membership onto data.frame
membership.df <- data.frame(word = names(membership),
membership = membership)
widths.list <- left_join(widths.list, membership.df, by = "word")
return(widths.list)
}
Using the cosineSilhouette()
function to calculating the average cosine silhouette width for each \(k\), we can plot \(k\) versus average cosine silhouette width across all observations for each number of clusters, \(k\).
set.seed(238942)
# calculate the average silhouette width for k=5, ..., 20
sil.width <- sapply(5:20, function(k) {
# generate k clusters
membership <- pam(cosine.similarity.full, k = k)
# calcualte the silhouette width for each observation
width <- cosineSilhouette(cosine.similarity.full,
membership = factor(membership$clustering))$width
return(mean(width))
})
# plot k verus silhouette width
data.frame(k = 5:20, width = sil.width) %>%
ggplot(aes(x = k, y = width)) +
geom_line() +
geom_point() +
scale_y_continuous(name = "Avergae silhouette width")
It is fairly obvious from this plot that the best \(k\) is \(k = 11\).
Next, for each set of cluster membership pairs (where each membership vector is calculated based on a 90% sub-sample of the data), we want to calculate the Jaccard similarity of the membership vectors.
Since each membership iteration corresponds to a 90% sub-sample, we ignore words that are missing from either of the iterations.
The generateClusters()
function generates k
clusters (where we range k
over some set of values such as 5 to 20). We do this N
times, each time taking a subset of 90% of the data.
Based on these N
iterations of clusters we will evaluate both performance and stability and select a number of clusters based on these criterion.
library(cluster)
# perform clustering for k in k.range clusters over N 90% sub-samples
generateClusters <- function(similarity.mat, k.range, N) {
random.subset.list <- lapply(1:100, function(i) {
sample(1:nrow(similarity.mat), 0.9 * nrow(similarity.mat))
})
lapply(k.range, function(k) {
print(paste("k =", k))
lapply(1:N, function(i) {
# randomly sample 90% of words
cosine.sample <- similarity.mat[random.subset.list[[i]], random.subset.list[[i]]]
# perform clustering
pam.clusters <- pam(1 - cosine.sample, k = k, diss = TRUE)
})
})
}
We decide to test the range of \(k = 5, ..., 20\) clusters and repeat each of these clusterings across 100 different 90% sub-sample.
# generate clusters ranging from 5 to 20 cluster groups for each of 100 subsamples
# This will take a little while to run
cluster.iterations <- generateClusters(cosine.similarity.full,
k.range = 5:20,
N = 100)
We next need to clean the results into a nice format. The outer list of join.cluster.iterations
below corresponds to each k
value. Each list entry is a data frame for a single subsample in which the first column corresponds to the word and the remaining columns correspond to the word cluster for each value of \(k\).
# clean the simulation structure
join.cluster.iterations <- lapply(cluster.iterations, function(list) {
# for each list of iterations (for a specific k),
# full-join the membership vectors into a data frame
# (there will be missing values in each column)
Reduce(function(x, y) full_join(x, y, by = "words"),
lapply(list, function(cluster.obj) {
df <- data.frame(words = names(cluster.obj$clustering),
clusters = cluster.obj$clustering)
}))
})
# clean column names
join.cluster.iterations <- lapply(join.cluster.iterations, function(x) {
colnames(x) <- c("words", paste0("membership", 1:100))
return(x)
})
Below we print the first six entries of the membership vectors for 7 of the 100 iterations when \(k = 5\). Notice that there are NA
values: these correspond to words that were omitted in the 90% subsample. There are actually 100 membership rows in the full data frame, each corresponding to an iteration of PAM with \(k=5\) on a 90% subsample.
# view the first 8 columns of the first data frame (correpsonding to k=5)
kable(head(join.cluster.iterations[[1]][, 1:8]))
words | membership1 | membership2 | membership3 | membership4 | membership5 | membership6 | membership7 |
---|---|---|---|---|---|---|---|
force | 1 | 1 | NA | 2 | 1 | 1 | 1 |
issues | 2 | 5 | NA | 1 | 4 | 4 | NA |
newark | 3 | 3 | 1 | 5 | 3 | 2 | 2 |
justices | 1 | 5 | 5 | 2 | 1 | 1 | 1 |
iraqi | 3 | 3 | 1 | 5 | 3 | 2 | 2 |
stock | 1 | 2 | 4 | NA | 1 | 1 | 1 |
Next for each pair of these cluster iterations, we can calculate the Jaccard similarity. To speed things up, we wrote a Jaccard function in C++. Note also that to avoid correlations, we take independent pairs: e.g. we calculate the Jaccard similarity between the membership vector from iterations 1 and 2, and then from 3 and 4, then from 5 and 6, etc. This means that for each value of \(k\) we calculate 50 Jaccard similarity values.
# calculate the pairwise jaccard similarity between each of the cluster
# memberships accross the common words
# to avoid correlation, we do this pairwise between simulations 1 and 2,
# and then between simulations 3 and 4, and so on
library(Rcpp)
library(reshape2)
# use Rcpp to speed up the computation
sourceCpp('code/Rcpp_similarity.cpp')
jaccard.similarity <- sapply(join.cluster.iterations,
function(cluster.iteration) {
sapply(seq(2, ncol(cluster.iteration) - 1, by = 2),
function(i) {
# calculate the Jaccard similarity between each pair of columns
cluster.iteration.pair <- cluster.iteration[ , c(i, i + 1)]
colnames(cluster.iteration.pair) <- c("cluster1", "cluster2")
# remove words that do not appear in both 90% sub-samples
cluster.iteration.pair <- cluster.iteration.pair %>%
filter(!is.na(cluster1), !is.na(cluster2))
# Calcualte the Jaccard similarity between the two cluster vectors
RcppSimilarity(cluster.iteration.pair[ , 1],
cluster.iteration.pair[ , 2])
})
})
We next want to melt jaccard.similarity
into a long-form data frame that is easy to manipulate for visualisation.
# average similarity over simulations
jaccard.similarity.long <- melt(jaccard.similarity)
colnames(jaccard.similarity.long) <- c("iter", "k", "similarity")
# k is the number of clusters
jaccard.similarity.long$k <- jaccard.similarity.long$k + 4
jaccard.similarity.long <- jaccard.similarity.long %>%
filter(k <= 20)
# average over iterations
jaccard.similarity.avg <- jaccard.similarity.long %>%
group_by(k) %>%
summarise(similarity = mean(similarity))
Plotting the Jaccard similarity for each \(k\), we find that as \(k\) increase, the variability decreases, but so too does the similarity!
# plot number of clusters versus Jaccard similarity
ggplot(jaccard.similarity.long) +
geom_boxplot(aes(x = k, y = similarity, group = k)) +
geom_line(aes(x = k, y = similarity),
linetype = "dashed",
data = jaccard.similarity.avg) +
ggtitle("Jaccard similarity versus k")
It seems as though for values of \(k\) from 8 to 20, there is little difference in the Jaccard similarity (with an average Jaccard similarity value close to 0.5), thus we maintain that \(k=11\) is a suitable choice.
Having decided that \(k=11\) is an appropriate number of clusters to generate, we use PAM on the full cosine distance matrix (it is easier to provide a dissimilarity matrix than it is to provide a similarity matrix when clustering). Later we will compare the clustering we obtain when we set \(k=12\) so we will compute this too.
# note that there are some floating point issues in the similarity matrix:
# some "1" entires are actually sliiightly larger than 1, so we round to
# the nearest 10 dp when calcualting the distance matrix
word.clusters <- pam(acos(round(cosine.similarity.full, 10)) / pi, k = 11, diss = TRUE)
word.clusters.12 <- pam(acos(round(cosine.similarity.full, 10)) / pi, k = 12, diss = TRUE)
We then define the cluster labels to be the medoid (center) word for that cluster (recall that PAM is like k-means but requires that the center of the cluster is a data point).
# print the cluster medoids
word.clusters$medoids
## [1] "just" "struggle" "american" "north" "government"
## [6] "pushes" "give" "murder" "proposal" "bombing"
## [11] "companies"
# convert the membership vector to a factor
word.membership <- factor(word.clusters$clustering)
# print the cluster medoids
word.clusters.12$medoids
## [1] "just" "struggle" "american" "north" "government"
## [6] "pushes" "children" "give" "murder" "proposal"
## [11] "bombing" "companies"
# convert the membership vector to a factor
word.membership.12 <- factor(word.clusters.12$clustering)
# replace integer membership by medoid membership
levels(word.membership) <- word.clusters$medoids
# replace integer membership by medoid membership
levels(word.membership.12) <- word.clusters.12$medoids
Next, we compare an example of clustering with \(k = 11\) and \(k = 12\). For the most part, we are curious about which clusters would be forced to split into two or more clusters.Below we plot a superheatmap displaying the proportion of words in each cluster that are the same.
# compare the membership vectors with 12 and 13 clusters
word.membership.split <- split(word.membership, word.membership)
word.membership.split.12 <- split(word.membership.12, word.membership.12)
compare.11.12 <- sapply(word.membership.split, function(i) {
sapply(word.membership.split.12, function(j) {
sum(names(i) %in% names(j)) / length(i)
})
})
superheat(compare.11.12,
heat.pal = c("white", "grey", "black"),
heat.pal.values = c(0, 0.1, 1),
column.title = "11 clusters",
row.title = "12 clusters",
bottom.label.text.angle = 90,
bottom.label.size = 0.4)
## quartz_off_screen
## 2
Next, we can calculate the cosine silhouette width for each word. We will plot this above the columns of our superheatmap.
# calcualte the cosine silhouette width
cosine.silhouette <-
cosineSilhouette(cosine.similarity.full, word.membership)
# arrange the words in the same order as the original matrix
rownames(cosine.silhouette) <- cosine.silhouette$word
cosine.silhouette <- cosine.silhouette[rownames(cosine.similarity.full), ]
Next, we want to order the clusters in order of average silhouette width.
# calculate the average width for each cluster
avg.sil.width <- cosine.silhouette %>%
group_by(membership) %>%
summarise(avg.width = mean(width)) %>%
arrange(avg.width)
# add a blank space after each word (for aesthetic purposes)
word.membership.padded <- paste0(word.membership, " ")
# reorder levels based on increasing separation
word.membership.padded <- factor(word.membership.padded,
levels = paste0(avg.sil.width$membership, " "))
We are now ready to plot a clustered superheatmap.
## quartz_off_screen
## 2
superheat(cosine.similarity.full,
# row and column clustering
membership.rows = word.membership.padded,
membership.cols = word.membership.padded,
# top plot: silhouette
yt = cosine.silhouette$width,
yt.axis.name = "Cosine\nsilhouette\nwidth",
yt.plot.type = "bar",
yt.bar.col = "grey35",
# order of rows and columns within clusters
order.rows = order(cosine.silhouette$width),
order.cols = order(cosine.silhouette$width),
# bottom labels
bottom.label.col = c("grey95", "grey80"),
bottom.label.text.angle = 90,
bottom.label.text.alignment = "right",
bottom.label.size = 0.28,
# left labels
left.label.col = c("grey95", "grey80"),
left.label.text.alignment = "right",
left.label.size = 0.26,
# title
title = "(a)")
Due to the amount of information it contains, the heatmap above appears very grainy, so we decide to smooth within each cluster (take the median value of each cell in the cluster).
## quartz_off_screen
## 2
superheat(cosine.similarity.full,
# row and column clustering
membership.rows = word.membership.padded,
membership.cols = word.membership.padded,
# top plot: silhouette
yt = cosine.silhouette$width,
yt.axis.name = "Cosine\nsilhouette\nwidth",
yt.plot.type = "bar",
yt.bar.col = "grey35",
# order of rows and columns within clusters
order.rows = order(cosine.silhouette$width),
order.cols = order(cosine.silhouette$width),
# bottom labels
bottom.label.col = c("grey95", "grey80"),
bottom.label.text.angle = 90,
bottom.label.text.alignment = "right",
bottom.label.size = 0.28,
# left labels
left.label.col = c("grey95", "grey80"),
left.label.text.alignment = "right",
left.label.size = 0.26,
# smooth heatmap within clusters
smooth.heat = T,
# title
title = "(b)")
Lastly, we produce some word clouds to identify the members of each cluster. The function below produces a word cloud for a specific word cluster.
library(RColorBrewer)
library(wordcloud)
# define a function that takes the cluster name and the membership vector
# and returns a word cloud
makeWordCloud <- function(cluster, word.membership, words.freq) {
words <- names(word.membership[word.membership == cluster])
words.freq <- words.freq[words]
# make all words black except for the cluster center
words.col <- rep("black", length = length(words.freq))
words.col[words == cluster] <- "red"
# the size of the words will be the frequency from the NY Times headlines
wordcloud(words, words.freq, colors = words.col,
ordered.colors = TRUE, random.order = FALSE, max.words = 80)
}
In each word cloud, the cluster center is in red, and the size corresponds to the word frequency from the NY Times headlines. It is thus interesting to notice that oftentimes the cluster center has fairly low frequency.
# plot word clouds
set.seed(52545)
for (word in levels(word.membership)) {
makeWordCloud(word, word.membership, words.freq = freq)
}