Sentiment and semantic composition

  1. Overview
  2. Data and code
  3. Subframes
  4. Bigram plots
  5. Comparison plots
  6. Expected Category adjustments
  7. Symmetric compositional hypotheses
  8. An additive, asymmetric compositional hypothesis
  9. Assessment

Overview

We've so far studied just word meanings. We have tried to embrace the fact that word senses are constantly being pushed around by the morphosyntactic and discourse context in which they occur, but we were still, in the end, just creating lexicons. A viable theory of meaning has to come to grips with semantic composition — how word meanings combine to form more complex meanings. The goal of this lecture is to take some tentative steps in that direction. This should also provide some connections with Noah Goodman's course and Shalom Lappin's lecture today (this place, next period!).

If you haven't already, get set up with the course data and code.

Semantic composition is an extremely active area of research in natural language processing; having gotten good at building rich lexicons, the NLPers have now turned their attention to the core problem of formal semantics, and they are making rapid progress, with new papers appearing all the time. Here's a partial list of people who are pursuing this issue, from a variety of perspectives (please let me know whom I've forgotten!):

In addition, a number of computational workshops and conferences have come together this year to form *Sem, which will probably evolve into a primary outlet for work on this topic.

It's too much to tackle the full problem of compositionality. Thus, I'm going to focus on a particular kind of compositional interaction, namely, those involving adverb–adjective combinations. In our ratings data, we were starting to see that modifiers have complex but systematic effects on the sentiment profiles of adjectives, so this seems like a natural place to start. It's also a vital part of good sentiment analysis, since one's text-level predictions about sentiment can be greatly improved simply by being sensitive to negation, attenuation, and emphasis as they relate to sentiment adjectives.

Here is my framing of the problem: suppose we have a bigram w1 w2 (like absolutely amazing) that has vector representation V. Suppose also that we also have separate vector representations for w1 and w2, call them v1 and v2. How can we use v1 and v2 to construct V? This is just the question of semantic compositionality (how does the meaning of the parts determine the meaning of the whole?), but phrased in terms of vector-space representations of meaning.

My experiments are directly inspired by those in this new paper by Richard Socher (who pursues more ambitious models than we will).

In principle, we can go after this question using any vector representation. In the last class, our vectors were derived from co-occurrence patterns. In the classes before that, our vectors were probability values, one for each category (star rating or EP reaction distribution). Here, we'll use just the star-rating vectors obtained from IMDB data, because it's possible to grasp intuitively what those vectors are like, whereas the 1000-plus dimensions of the unsupervised approaches are mind-boggling.

Data and code

The data file for this unit:

  1. ## Load and inspect the data.frame:
  2. bi = read.csv('imdb-bigrams.csv')
  3. head(bi)
  4. Word1 Tag1 Word2 Tag2 Category Count Total
  5. 1 aa n meeting n 1 6 25395214
  6. 2 aa n meeting n 2 0 11755132
  7. 3 aa n meeting n 3 4 13995838
  8. 4 aa n meeting n 4 2 14963866
  9. 5 aa n meeting n 5 0 20390515
  10. 6 aa n meeting n 6 9 27420036

The column values:

  1. Word1: the first word in the bigram
  2. Tag1: the tag for the first word, reduced to the WordNet tags: a, n, r (adverb), or v
  3. Word2: the second word in the bigram
  4. Tag2: the tag for the second word, reduced to the WordNet tags: a, n, r (adverb), or v
  5. Category: the raw star rating (1-10)
  6. Count: the number of tokens for this bigram in this Category
  7. Total: the total number of bigrams in Category

This is a big file:

  1. ## Word types in position 1:
  2. nlevels(bi$Word1)
  3. [1] 22220
  4. ## Word types in position 2:
  5. nlevels(bi$Word2)
  6. [1] 24966
  7. ## How is Tag1 distributed?
  8. t1 = xtabs(~ Tag1, data=bi)
  9. t1/sum(t1)
  10. Tag1
  11. a n r v
  12. 0.2397737 0.4737258 0.1264735 0.1600270
  13. ## How is Tag2 distributed?
  14. t2 = xtabs(~ Tag2, data=bi)
  15. t2/sum(t2)
  16. Tag2
  17. a n r v
  18. 0.11009580 0.56157666 0.07749335 0.25083418
  19. ## How about the way the tags combine?
  20. combo = xtabs(~ Tag1 + Tag2, data=bi)
  21. combo/sum(combo)
  22. Tag2
  23. Tag1 a n r v
  24. a 0.023299277 0.201552183 0.005089869 0.009832419
  25. n 0.011006686 0.275679855 0.030212567 0.156826667
  26. r 0.047280786 0.009910979 0.014140821 0.055140933
  27. v 0.028509053 0.074433644 0.028050097 0.029034165
  28. ## Our usual heavy skew towards the positive:
  29. barplot(bi$Total[1:10], names.arg=bi$Category[1:10], xlab='Category', ylab='Words')
Figure fig:totals
Category sizes, by Count for bigrams.
figures/composition/totals.png

As before I've written some code to facilitate interacting with the data:

  1. source('composition.R')

The basic structure of the code is similar to previous lectures: functions for extracting subframes of the data and functions for visualizing the results. What's really new is the set of functions for making predictions about compositionality.

Subframes

The subframe extraction function is bigramCollapsedFrame. It allows you to extract bigrams, or even unigrams, based on their column values. Its only required argument is a data.frame like bi above, though calling it with only this argument might cause some kind of explosition (not literally), since it will try to collapse the entire file.

Here's an example involving a fully specified bigram:

  1. bigramCollapsedFrame(bi, word1='very', tag1='r', word2='best', tag2='a')
  2. Phrase Category Count Total Freq Pr
  3. 1 very/r best/a -4.5 22 25395214 8.663050e-07 0.04451630
  4. 2 very/r best/a -3.5 10 11755132 8.506923e-07 0.04371402
  5. 3 very/r best/a -2.5 11 13995838 7.859479e-07 0.04038705
  6. 4 very/r best/a -1.5 11 14963866 7.351042e-07 0.03777437
  7. 5 very/r best/a -0.5 19 20390515 9.318058e-07 0.04788216
  8. 6 very/r best/a 0.5 44 27420036 1.604666e-06 0.08245803
  9. 7 very/r best/a 1.5 87 40192077 2.164606e-06 0.11123132
  10. 8 very/r best/a 2.5 134 48723444 2.750216e-06 0.14132374
  11. 9 very/r best/a 3.5 167 40277743 4.146210e-06 0.21305888
  12. 10 very/r best/a 4.5 342 73948447 4.624844e-06 0.23765412

The Category values are centered at 0 to create a more intuitive scale, with negativity corresponding to negative numbers, as in our first lecture.

You can leave off one or both of the tag arguments:

  1. bigramCollapsedFrame(bi, word1='very', tag1='r', word2='best')
  2. Phrase Category Count Total Freq Pr
  3. 1 very/r best -4.5 97 25395214 3.819617e-06 0.03780675
  4. 2 very/r best -3.5 48 11755132 4.083323e-06 0.04041692
  5. 3 very/r best -2.5 65 13995838 4.644238e-06 0.04596888
  6. 4 very/r best -1.5 69 14963866 4.611108e-06 0.04564096
  7. 5 very/r best -0.5 112 20390515 5.492750e-06 0.05436749
  8. 6 very/r best 0.5 190 27420036 6.929240e-06 0.06858593
  9. 7 very/r best 1.5 414 40192077 1.030054e-05 0.10195519
  10. 8 very/r best 2.5 662 48723444 1.358689e-05 0.13448364
  11. 9 very/r best 3.5 822 40277743 2.040829e-05 0.20200222
  12. 10 very/r best 4.5 2008 73948447 2.715405e-05 0.26877204

You can even leave off one or both of the word arguments:

  1. bigramCollapsedFrame(bi, word1='very', tag1='r', tag2='a')
  2. Phrase Category Count Total Freq Pr
  3. 1 very/r a -4.5 28379 25395214 0.001117494 0.06905723
  4. 2 very/r a -3.5 15310 11755132 0.001302410 0.08048438
  5. 3 very/r a -2.5 20293 13995838 0.001449931 0.08960068
  6. 4 very/r a -1.5 23060 14963866 0.001541046 0.09523124
  7. 5 very/r a -0.5 31801 20390515 0.001559598 0.09637769
  8. 6 very/r a 0.5 45374 27420036 0.001654775 0.10225934
  9. 7 very/r a 1.5 75510 40192077 0.001878728 0.11609886
  10. 8 very/r a 2.5 98178 48723444 0.002015005 0.12452029
  11. 9 very/r a 3.5 77862 40277743 0.001933127 0.11946051
  12. 10 very/r a 4.5 127933 73948447 0.001730030 0.10690979

You can look at a specific value for just Word1 or Word2, with or without their tag specifications:

  1. bigramCollapsedFrame(bi, word2='amazing', tag2='a')
  2. Phrase Category Count Total Freq Pr
  3. 1 amazing/a 1 1158 25395214 4.559914e-05 0.04336273
  4. 2 amazing/a 2 468 11755132 3.981240e-05 0.03785979
  5. 3 amazing/a 3 734 13995838 5.244416e-05 0.04987203
  6. 4 amazing/a 4 744 14963866 4.971977e-05 0.04728126
  7. 5 amazing/a 5 1128 20390515 5.531984e-05 0.05260667
  8. 6 amazing/a 6 1915 27420036 6.983944e-05 0.06641415
  9. 7 amazing/a 7 3553 40192077 8.840051e-05 0.08406489
  10. 8 amazing/a 8 6479 48723444 1.329750e-04 0.12645322
  11. 9 amazing/a 9 7994 40277743 1.984719e-04 0.18873781
  12. 10 amazing/a 10 23589 73948447 3.189925e-04 0.30334746

A word of caution: it often matters whether you use word1 or word2 to specify the word you want to look at (and similarly for tag1 and tag2. This is because we are looking at a selected subset of the bigrams data, and thus the unigram distributions derived from the two slots can differ.

Bigram plots

The function bigramPlot is just like our other plotting functions for supervised data. Here is a call with all of the default arguments given as their defaults, except word1, which I specified:

  1. bigramPlot(bi, word1='amazing', word2=NULL, tag1=NULL, tag2=NULL, ylim=NULL, probs=FALSE, col=1, add=FALSE)
Figure fig:amazing
Plot of amazing as the first word in the bigram, no tag specified.
figures/composition/amazing.png

The word and tag arguments behave like the corresponding arguments for bigramCollapsedFrame.

You can use ylim to adjust the y-axis, which is important if you need a uniform y-axis in order to do comparisons.

The col argument gives the color of the plot lines. 1 is the same as "black". (Colors can be specified with integers or with strings like "black", "blue", and so forth.)

You can mostly ignored the add argument. If set to TRUE, it will try to add your plot line to an existing plot. This is used by the function bigramCompositionPlot, which is described in the next section.

Exercise ex:plot Play around with bigramPlot. In anticipation of the work we want to do with this data, you might focus on some particular interactions — for example, can you start to discern how a given modifier interacts with the things it modifies?

Comparison plots

The function bigramCompositionPlot allows you to compare a bigram with its constituent parts. Here is a typical call:

  1. bigramCompositionPlot(bi, 'really','good')
Figure fig:reallygood
The bigram really good compared with its constituents.
figures/composition/really-good.png

You can optionally specify tag1 and tag2 to further restrict your gaze. ylim allows you to specify the y-axis values.

To add expected category values, use ec=TRUE.

Exercise ex:plot Continue the investigations you began above, now using bigramCompositionPlot to get an even sharper perspective on what modifiers do to their arguments. At this point, you might pick a single modifier and sample argument (word2) values to see whether patterns emerge. The ec=TRUE flag might start to suggest a coarse-grained semantics.

Expected Category adjustments

Semantic theory strongly suggests that, in adverb–adjective constructions, the adverb will take the adjective as its argument and do something with it to create a new adjectival meaning. In present terms, this comes down to see how adverbs modulate the distributional profiles of the things they modify.

One simple high-level method for doing this is to look exhautively at all of the modification data we have for a particular pattern. Expected Category (EC) values provide a rough first summary: we might expect some adverbs to push these values out towards the edges, with others pushing them to the middle.

Thus, for any adverb Adv and adjective Adj, we can look at the different between the EC for Adv and the EC for Adj. Perhaps Adv is a function that does something systematic to EC values.

Compiling all of these differences takes a lot of computing time and some extra trickery with R tables. Thus, I've pre-compiled it into a CSV file — just for the adverbs and adjectives, because my laptop was struggling under the weight of the full vocabulary for the bigrams data:

  1. aa = read.csv('advadj-ecs-probs.csv')
  2. head(aa)
  3. Word1 Tag1 Word2 Tag2 BigramEC B1 B2 B3
  4. 1 about r bad a -0.08451206 0.34973878 0.00000000 0.00000000
  5. 2 about r big a -0.25807297 0.24525559 0.07569117 0.00000000
  6. 3 about r different a 0.25645245 0.11588358 0.16093879 0.09011521
  7. 4 about r enough a -1.14190434 0.06129747 0.19863609 0.11122323
  8. 5 about r good a -0.22050796 0.12391260 0.13384737 0.08647590
  9. 6 about r great a 0.57533210 0.05244627 0.00000000 0.19032577
  10. B4 B5 B6 B7 B8 B9
  11. 1 0.00000000 0.00000000 0.10797082 0.22098114 0.06076261 0.22051114
  12. 2 0.05946055 0.08727192 0.12979702 0.13282614 0.07304572 0.08836242
  13. 3 0.02809519 0.05154512 0.04599694 0.09937082 0.11648541 0.14091083
  14. 4 0.15604213 0.22902742 0.02838549 0.11619173 0.07987235 0.01932410
  15. 5 0.09705802 0.10684109 0.09269264 0.06624853 0.07948881 0.08413698
  16. 6 0.26702009 0.00000000 0.00000000 0.09941394 0.02733559 0.16533750
  17. B10 ArgumentEC A1 A2 A3 A4
  18. 1 0.04003551 -1.7473124 0.23192727 0.17646539 0.14067880 0.11415540
  19. 2 0.10828946 -0.1378916 0.09608838 0.10263114 0.10698114 0.10981386
  20. 3 0.15065812 0.7925735 0.06265967 0.06635101 0.07147188 0.08086372
  21. 4 0.00000000 -0.4101139 0.09909160 0.10491410 0.11348335 0.12606232
  22. 5 0.12929807 -0.0333225 0.08795664 0.09388235 0.09935886 0.10677708
  23. 6 0.19812082 1.1029807 0.05216052 0.05749075 0.06556529 0.07425254
  24. A5 A6 A7 A8 A9 A10
  25. 1 0.09778160 0.07773984 0.05537165 0.03957109 0.03343531 0.03287364
  26. 2 0.10520268 0.10940290 0.10078790 0.09548835 0.08733057 0.08627308
  27. 3 0.08848067 0.10295605 0.12084785 0.13491833 0.14058823 0.13086260
  28. 4 0.12717739 0.12065487 0.09754466 0.07370361 0.06967177 0.06769633
  29. 5 0.11172191 0.11513720 0.11482930 0.10185136 0.08713044 0.08135486
  30. 6 0.08586029 0.09679526 0.11185751 0.13135473 0.14841190 0.17625122

As you can see, this contains the distributions for both the bigram and for Word2. We will use these later. For now, we can focus on the columns relevant for ECs:

  1. head(subset(aa, select=c(Word1, Tag1, Word2, Tag2, BigramEC, ArgumentEC)))
  2. Word1 Tag1 Word2 Tag2 BigramEC ArgumentEC
  3. 1 about r bad a -0.08451206 -1.7473124
  4. 2 about r big a -0.25807297 -0.1378916
  5. 3 about r different a 0.25645245 0.7925735
  6. 4 about r enough a -1.14190434 -0.4101139
  7. 5 about r good a -0.22050796 -0.0333225
  8. 6 about r great a 0.57533210 1.1029807

The function ecAdjustmentPlot plots, for any given adverb (value of Word1), the distribution of ArgumentEC - BigramEC.

  1. ecAdjustmentPlot(aa, 'really')
Figure fig:really
The distribution of EC adjustments imposed by really. The green line just marks 0, to help orient you.
figures/composition/really.png

Exercise ex:ecadj Continue your adverbial investigations (begun in exercise ex:plot and exercise ex:ec), but now using ecAdjustmentPlot. Try to use the distributions you see to formulate generalizations about how specific classes of adverbs work.

Symmetric compositional hypotheses

As before, EC values are useful but too limited (and sometimes too untrustworthy) to carry the day. Since they are point estimates, they ignore a lot of the information we have in the sentiment distributions.

What we a really want is to define adverbs as functions that take in adjectival sentiment distributions and morph them somehow. A good theory will be one that morphs them into something resembling the distributions we observe for the corresponding bigram. In the terms of these sentiment distributions, that just is semantic composition.

The language of probability suggests two simple hypotheses right off the bat:

  1. We can multiply the adverb vector and the adjectival vector, point-wise, and then renormalize those values to create a new probability distribution. This corresponds to the semantic claim that composition is intersective.
  2. We can add the adverb vector and the adjectival vector, point-wise, and then renormalize the values. This corresponds to the claim that semantic composition is disjunctive.

I've called this section "Symmetric compositional hypotheses" because both of these analyses assume that the adverb and the adjective are equal partners, with neither truly a functor on the other.

Both of these analyses will seem wrong to the semanticist in you, but, bear with me! It is still instructive to see how well they work.

Returning to bigramCompositionPlot, we can plot these predictions alongside the empirical estimates by filling in values for the optional prediction.func argument. The function Intersective models the multiplicative hypothesis, and the function Disjunctive implements the additive hypotheses:

  1. bigramCompositionPlot(bi, 'really','good', prediction.func=Intersective)
  2. bigramCompositionPlot(bi, 'really','good', prediction.func=Disjunctive)
Figure fig:reallygood-intpred
The bigram really good compared with its constituents, along with intersective predictions.
figures/composition/really-good-int.png
Figure fig:reallygood-disjpred
The bigram really good compared with its constituents, along with disjunctive predictions.
figures/composition/really-good-union.png

Exercise ex:sym Explores these compositional hypotheses. What kinds of values do they tend to give? Are there are adverbs for which their assumptions seem approximately true or very clearly false. Why?

An additive, asymmetric compositional hypothesis

Our guiding insight from compositional analyses is that the adverb will be a functor on the adjective, taking in its meaning and transforming it in some way to produce a meaning for the whole. The above simple analyses don't make good on this.

As a first pass, suppose you want to analyze a particular adverb Adv. Consider all of the adverb–adjective bigrams it participates in. In each case, it warps the adjective to create a new vector. That is, in each case, for each star rating, it performs an adjustment, moving the probability up or down:

  1. bigramCompositionPlot(bi, 'really','good')
  2. ## Add arrows showing the adjustment:
  3. rg = bigramCollapsedFrame(bi, 'really', 'good')
  4. g = bigramCollapsedFrame(bi, word2='good')
  5. arrows(g$Category, g$Pr, rg$Category, rg$Pr, length=0.1, col='green')
  6. ## Vector of differences:
  7. a = rg$Pr - g$Pr
  8. a
  9. [1] -0.028067482 -0.027401446 -0.023103547 -0.022600395 -0.012180962
  10. [6] -0.007382117 0.002950026 0.034558100 0.036675384 0.046552439
  11. ## Perfectly reconstruct the bigram probabilities by adding:
  12. a + g$Pr
  13. [1] 0.05990439 0.06645296 0.07624326 0.08411042 0.09943191 0.10766863 0.11773779 0.13654382 0.12394086 0.12796597
  14. rg$Pr
  15. [1] 0.05990439 0.06645296 0.07624326 0.08411042 0.09943191 0.10766863 0.11773779 0.13654382 0.12394086 0.12796597
Figure fig:really-adjust
The "adjustments" really imposes on good.
figures/composition/adjustments.png

Hypothesis: an adverb is a function that takes an adjective meaning (qua probability vector) V and adjusts each Vi by the mean difference it imposes for category i, with the mean take over all the adjectives in our data.

The mean differences are the maximum likelihood estimates, so this is a natural hypothesis to start with, assuming that the data are underlyingly linear in a way that makes differences appropriate.

The file that we loaded above as aa

  1. aa = read.csv('advadj-ecs-probs.csv')

contains the values we need to calculate the mean difference vector for each adverb. Here is a example of how to do that:

  1. mod = subset(aa, Word1=='really')
  2. ## Bigram vectors:
  3. bipr = mod[, paste('B', seq(1,10), sep='')]
  4. ## Adjective vectors:
  5. adjpr = mod[, paste('A', seq(1,10), sep='')]
  6. ## Differences
  7. diffs = bipr - adjpr
  8. ## Averages; the function apply takes in the data.frame as first argument
  9. ## The 2 says to calculate via columns (1 = rows)
  10. ## The third argument is the function:
  11. apply(diffs, 2, mean)
  12. B1 B2 B3 B4 B5
  13. -0.002339578 -0.004261728 0.001659273 -0.001756302 0.006486497
  14. B6 B7 B8 B9 B10
  15. 0.009839113 0.002702927 -0.004702173 -0.004021621 -0.003606409

The adjustments are then made by adding the means from the probabilty values for the adjective, and then renormalizing to get back into probability space.

To see these predictions in plots, we can again use bigramCompositionPlot, here with prediction.func given by Differences. You also need to use the keyword argument prediction.func.arg=a so that Differences gets that table of values as one of its arguments.

  1. bigramCompositionPlot(bi, 'really', 'good', prediction.func=Differences, prediction.func.arg=aa)
Figure fig:really-good-add
Predicted value for really good given good and our additive, mean-based model for really.
figures/composition/really-good-add.png

Exercise ex:additive Use bigramCompositionPlot to try to home in on the strengths and weaknesses of this additive hypothesis about how modifiers work. Are there inherent limitations to this approach that we should be aware of?

Assessment

To assess the above functions, I calculated all the predictions for the bigrams in our aa data.frame and used the KL-divergence as a measure of how close we had come to reconstructing the observed bigram distribution. Here are the (surprising) results; lower (smaller divergence from truth) is better. The symmetric intersective analysis is in the lead!

Figure fig:results
Assessing the composition methods using KL-divergence. All the differences are pairwise significant at 0.05 according to wilcox.test, which implements the Wilcoxon signed rank test (an improvement on the t-test that does not presuppose normally distributed data). The full plot is on the left, and an easier-to-read detail is at the right.
figures/composition/assess-full.png figures/composition/assess.png
Figure fig:socher
Table of results from Socher et al. 2012, who use the same data set as we do. The third model is our disjunctive one, and the fourth is our intersective one. I am not sure why his numbers are so noteably better than ours above.
figures/composition/socher.png

Exercise ex:mult Another natural asymmetic analysis is a multiplicative version of the additive one above. Here, we would multiply the bigram and adjective probabilities to get a vector of adjustments X and then use a ratio X/A, where A is the vector of adjective probabilities, to make adjustments during composition. Implement this using FunctorStyleModel in composition.R, on the model of Differences, and see how it does.