In our previous classes, we relied on naturalistic annotations — star ratings at review sites and reader reactions at the Experience Project — to develop sentiment-rich lexicons. In NLP parlance, this was supervised lexicon building, in the sense that we made inferences only where we could be guided by the annotations.
There is an abundance of such data on the Web these days, at sites covering a variety of topics and social situations, so we could use those same methods to develop lots of specialized lexicons.
However, inevitably, there will be situations in which we simply don't have annotated data of the right type. Some examples:
The goal of this lecture is to provide you with some simple but powerful methods for capturing some of these phenomena where you have no sentiment labels. The approaches fall under the heading of unsupervised vector space models. Such approaches tend to be less accurate than the supervised ones we covered before, but they are considerably more flexible.
If you haven't already, get set up with the course data and code.
That's the basic recipe. For different design matrices, the procedure differs slightly.
For example, if you are building a word × document matrix, then the rows of M represent words and the columns of M represent documents. The scan in step 2 then just keeps track of (word, document) pairs — compiling the number of times that word appears in document. Such matrices are often used in information retrieval, because the columns are multi-set representations of documents. They are much sparser than the the word × word matrices we will work with here. (In my experience, they yield lower quality lexicons, but others have reported good results with them.)
We'll return to matrix design types near the end of this lecture. Before doing that, let's get a feel for what the word × word matrices can tell us.
The code distribution for this course includes a few different word × word matrices, and code for working with them. Here's a look at a slice of the IMDB word × word matrix, which is a (5000 x 5000)-sized matrix derived from this large public release of IMDB reviews. The notion of context is co-occurrence in the same review. The vocabulary was chosen by sorting the words by overall frequency and keeping the top 5000 items.
The diagonals in this matrix give the total token count of the word, and the values in imdb[i, j] and imdb[j, i] are always identical. Our guiding idea will be that the rows in these matrices (equivalently, the columns, given our design) capture important aspects of the meanings of the words in our lexicon. Additionally, we will assume that similarity between rows correlates with similarity of meaning.
It's worth quickly noting that you might consider an initial smoothing step at this stage. In the simplest case, this just means adding a small uniform value to each cell. In R, you can do this to a matrix m with m = m + x, where x is the smoothing factor. Since it is easy to smooth the matrix after building it, I recommend not smoothing at the initial count stage, so that you have flexibility later.
The most basic and intuitive distance measure between vectors is euclidean distance:
If we stick with vectors of two elements, treating the first as the x coordinate and the second as the y coordinate, then we can plot our vectors on a two-dimensional plane. Euclidean distance the measures the shortest line we can draw between the points.
The vsm.R code that you already loaded (with source('vsm.R')) contains a function EuclideanDistance that implements this for vectors. Here is an illustration:
The numerical values of course jibe well with the raw visual distance in the plot. However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don't look good: the distributions of b and c are more or less directly opposed, suggesting very different meanings, whereas a and b are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.
In terms of the large vector space models we will soon explore, a and b resemble a pair like superb and good, which have similar meanings but very different frequencies. In contrast, b and c are like good and bad — similar overall frequencies but different distributions with respect to the overall vocabulary.
These affinities are immediately apparent if we normalize the vectors by their length:
The vsm.R function for this is LengthNorm:
Here, the connection between a and b is more apparent, as is the opposition between b and c.
Cosine distance takes overall length into account; it is equivalent to Euclidean distance over length normed vectors, since it defines Euclidean distance around the unit sphere:
The following plot shows the effect that cosine similarity has on our running example. The distances can be generated with CosineDistance:
Cosine similarity is typically better for capturing semantic similarities, because these are usually not dependent upon overall frequency, but rather only on how the words are distributed with respect to other words.
The vsm.R code also makes available KL divergence (KLDivergence) and skew (Skew), two other metrics that also correct for overall frequency. Additional measures can easily be added to the code. Some examples: Jaccard distance, Dice distance, Manhattan distance, matching coefficient, and overlap. As long as your implementation takes in an R numeric vector and returns a single numeric value, it should be compatible with the functions described below.
Exercise vsm:ex:dist Use KLDivergence on the vectors a, b, and c defined above. How do the results compare with EuclideanDistance and CosineDistance?
Exercise vsm:ex:skew The Skew function uses KL-divergence but with a prior value between 0 and 1 that pushes the distribution being compared to be more like the reference distribution:
\[ Skew_{\alpha}(p, q) = D(p||\alpha q + (1 - \alpha)p) \]Beginning with the the count vectors p = c(1,2,7) and q = c(7,2,1), run Skew(p, q, alpha=x) for values of x, including at least 0 and 1. Describe what happens as alpha goes from 0 to 1.
Exercise vsm:ex:others Implement Jaccard and Dice distance metrics. What is noteworthy about the values and rankings that these two deliver?
The function Neighbors in vsm.R is a highly flexible interface for seeing which associations a given distance metric finds in a matrix.
By default, if you give Neighbors just a matrix as its first argument and a word as its second argument, it will assume that the word labels a row and build a data.frame that gives the distances from that word for each word in vocabulary (each row name):
To find the things that are farthest from happy, use tail(df).
Here is an equivalent call with the default arguments specified:
If byrow=FALSE, then Neighbors will assume you've given a column name and will thus give column neighbors. For our word × word matrix, this is equivalent (though, due to floating point imprecision, the values are sometimes slightly shuffled).
You can give different distance functions as the values of CosineDistance, either those already provided by vsm.R or ones you defined on your own.
Exercise vsm:ex:neighbors Pick a few words from a single domain and see what their neighbors are like using Neighbors, comparing CosineDistance with EuclideanDistance.
Exercise vsm:ex:freq1 We saw that euclidean distance favors raw frequencies. Find words in the matrix imdb that help make this point: a pair that are semantically unrelated but close according to EuclideanDistance, and a pair that are semantically related by far apart according to EuclideanDistance.
Exercise vsm:ex:freq2 To what extent does using CosineDistance address the problem you uncovered in exercise vsm:ex:freq1?
We've seen that, to capture semantic similarities, we need to normalize the vectors, either using LengthNorm or by turning them into probability distributions (in KL divergence and its variants). This step needs to be taken, but it is risky to take alone, because it erases the amount of evidence we have about each word. For example, intuitively, we are in different evidential situations with the following vectors:
However, if we turn them into probability distributions, then the distinction is completely erased:
The same thing happens with length normalization:
This section seeks to address this problem by re-weighting the counts in the matrix to better capture the underlying relationships, amplifying the things for which we have evidence and reducing the things that we have little evidence for.
Before moving on, it's worth noting that remapping each row/column by turning it into a vector of probabilities and length normalizing are both kinds of re-weighting. They suffer from the above drawback, which is why we need to do more, but the more complex weighting schemes all capture the normalization insight that these remappings embody.
The method I focus on is a variant of pointwise mutual information (PMI). The basic PMI calculation takes a count matrix M and reweights each cell as follows:
The value sum(F[i, ]) is the row probability, and the value sum(F[ , j]) is the column probability. This is implemented in vsm.R as PMI:
What have we done? Well, the first two rows are identical, a kind of normalization as we saw above. The first two values in row 3 have been greatly amplified in virtue of the fact that the rows have relatively low probability. This is arguably good; whereas the first two rows involve things that look like function words (they occur almost everywhere), row 3 contains something that might be genuinely informative. The biggest worry I see is that p[4,4] is positively enormous. Thinking row-wise, it is basically an isolate, and column-wise it is also surprising. PMI treats it as though it were extremely valuable and special. However, in truth, it is probably some kind of weird and untrustworthy oddity in our data.
A common initial smoothing step in vector-space models is to reduce all negative values to 0: Positive PMI (PPMI).
The more important step is what I'll call contextual discounting. It pushes back against the tendency of PMI to favor things with extremely low counts. It does this by penalizing items that find themselves in rows or columns that are sparse:
PMI implements this discounting with the keyword argument discounting=TRUE:
We have made progress in diminishing the worrisome m[4,4] cell, which had gotten so big relative to its underlying count. Let's see what things look like with our IMDB matrix:
Exercise vsm:ex:pmi Play around with PMI and its variants, with different distance measures, using the small matrix m defined above.
Exercise vsm:ex:tfidf1 TF-IDF (term frequency-inverse document frequency is a widely used weighting scheme in information retrieval:
TF-IDF is implemented in vsm.R as TFIDF. Run TFIDF on the toy matrix m that we defined above, and describe the results. How do they compare with those of PMI and its variants?
Exercise vsm:ex:tfidf2 Reweight the imdb matrix using TFIDF, and then explore the results using the Neighbors function. To my eye, the results are not as good as those of PPMI with contextual discounting. What features of the matrix are likely culprits? Are there matrix types where you would expect TF-IDF to do better?
You can begin to get a feel for what your matrix is like by poking around with the Neighbors function to see who is close to or far from whom. But this kind of sampling is unlikely to lead to new insights, unless you luck out and start to see an interesting cluster of associations developing.
The t-SNE visualization technique is a large-scale way of identifying associations in an intuitive way, to guide later and more precise investigations. This function takes the matrix, does some dimensionality reduction on it, and then finds a clever way to display the resulting distances in two dimensions.
The vsm.R function for visualizing a matrix this way is Matrix2Tsne. The underlying R implementation is slow, so it's a good idea to check the code with a toy example before waiting a long time only to encounter a bug:
I've found that the best results come from first reweighting the matrix using Positive PMI with contextual discounting and then visualizing:
The R implementation is slow — the above takes more than an hour on my laptop, but the results are worth it! In interpreting the results, you should regard tight clusters of words as interestingly related. Conversely, large evenly spaced groups of words are those for which t-SNE could not find associations. At the macro-level, distance between clusters seems not to be meaningful, and overall position in the diagram (top, left, etc.) is arbitrary — chosen by t-SNE to deliver a perspicuous representation, not to capture anything about the data directly.
Here's an example involving adjective-adverb pairs derived from the NYT section of the English Gigaword. The matrix design is adjective × adverb. The pairs come from advmod dependencies as defined by the Stanford Dependency Parser. The first plot clusters adjectives, the second adverbs.
The t-SNE visualization of our running example suggests some lexical clusters that in turn suggest the beginnings of a lexicon. The semantic orientation method of Turney and Littman is a general method for building such lexicons for any desired semantic dimension.
The method simply exploits our above insights about vector similarity. Here are the steps:
This method is implemented in vsm.R as SemanticOrientation. Here is a typical function call:
Here, I am using the weighted matrix we defined above using Positive PMI with contextual discounting. This reliably gives excellent results with the semantic orientation method, in my experience.
The positive score here means that the sum for neg was larger than for pos, that is, that 'great' had a larger overall distance from neg than from pos. That seems good. Similarly:
The SemanticOrientation function allows you to leave off the word argument. If you do this, it churns through the whole vocabulary, scoring it against the supplied seeds sets. The result is an ordering of the vocabulary, with the top (lowest scoring) elements being nearest to seeds1 and the bottom (highest scoring) elements associating with seeds2:
Exercise vsm:ex:noweight Do the above with the unweighted matrix. What do the results look like? How does the lack of discounting contribute?
Exercise vsm:ex:newopp Pick a new semantic opposition and try it out.
Exercise vsm:ex:nway Could the semantic orientation method be extended to n-way oppositions?
Exercise vsm:ex:km Kennedy and McNally clusters?
The above methods deliver useable lexicons for a wide variety of seed-sets. However, they are not capable of capturing higher-order associations in the data. For example, both gnarly and wicked are used as slangily positive adjectives. We thus expect them to have many of the same neighbors. However, at least stereotypically, gnarly is Californian and wicked is Bostonian. Thus, they are unlikely to occur often in the same texts. Dimensionality reduction techniques are often capable of capturing their semantic similarity (and have the added advantage of shrinking the size of our data structures).
The following matrix implements my gnarly/wicked example in a word × document matrix:
The two words of interest never co-occur, but they co-occur with roughly the same set of other items. As expected, our usual favored scheme cannot capture this:
PPMI with discounting is no help:
Latent Semantic Analysis (LSA) to the rescue. LSA is based in singular value decomposition (SVD), a technique for decomposing a single matrix into three matrices: one for the rows, one for the columns, and one representing the singular values. The rows and columns of the new matrices are othonormal, which means roughly that all correlations their contain have been factored out. In addition, the singular values are organized from most to least important. If we simply multiply all three matrices together again, we get back to our original matrix. However, if we multiply the first \(k\) columns of the row matrix by the first \(k\) singular values, then we obtain a reduced dimensional version of our row matrix, one in which all correlations have been removed.
LSA is implemented in vsm.R as LSA. The only preliminary step is to run svd. (This can take a long time, so it is best to do it once and then fiddle around with it.)
We can then call LSA with the svd output as the first argument and optional additional arguments. For a large matrix like imdb, k=100 is a good starting point. Here, we use k=2 on our small example:
In trunc, gnarly and wicked are close neighbors:
LSA is a classic, and relatively speedy, dimensionality reduction technique. Newer and more probabilistic alternatives include Probabilistic LSA (PLSA; Hofmann 1999), Latent Dirichlet Allocation (LDA; Blei et al. 2003; Steyvers and Griffiths 2006). (For more, ). In addition, t-SNE (used above) implements an LSA-like algorithm (PCA). For even more: Turney and Pantel 2010:160.
Exercise vsm:ex:lsa Run LSA on the imdb and imdb.pccd matrices and then check a few words you care about — ideally, some that mix different semantic dimensions and different overall frequencies. How do the results compare with what you saw for these matrices before reduction?s
Exercise vsm:ex:k What happens if you set k=1 using LSA. What do the results look like then? What do you think this first (and now only) dimension is capturing?
I focussed on word × word matrices above, and we saw a few word × document matrices in illustrative examples. I also showed some pictures of an adjective × adverb matrix. This is just a glimpse of the infinitude of possible designs. Some others, including one that goes beyond two dimensions:
In any vector-space project, the matrix design will be the most important step you take. No amount of reweighting and dimensionality reduction will transcend a bad choice at this stage, and smart choices will amplify the utility of those steps.
Exercise vsm:ex:additions Propose three additions to the above list of matrix designs, and say briefly what research questions those designs would engage (preferably in the area of sentiment, but not necessarily.)
Exercise vsm:ex:worddoc The code distribution for today also contains a word × document matrix derived from the same data as imdb used above: imdb-worddoc.csv. Using the above techniques and measures, try to get a feel for how this matrix behaves — how it differs from the word × word version and what that means for analysis.
Exercise vsm:ex:advmod The code distribution for today also contains the adjective × adverb matrix derived from Gigaword, as used in figure vsm:tsne-advadj: gigawordnyt-advmod-matrix.csv. Using the above techniques and measures, try to get a feel for what can be done with this matrix.
Exercise vsm:ex:scaletypes A bit of background: Syrett and Lidz (2010; 30-Month-olds use the distribution and meaning of adverbs to interpret novel adjectives. Language Learning and Development 6(4): 258-282) report on a corpus study of adverb–adjective combinations in English, seeking to find patterns that can be traced to interpretive restrictions and preferences. For example, half full sounds normal and is easy to interpret, whereas half tall is odd and hard to interpret. This pattern presumably traces to the sort of scales along which we measure full and tall. (You might mutter to yourself different combinations of completely, somewhat, and mostly with full, smart, and damp to get a feel for what the patterns are like.) Use the gigawordnyt-advmod-matrix.csv matrix to explore these claims.