Adverb–adjective combinations in reviews

  1. Overview
  2. The data file
  3. The code
  4. Extracting subframes
  5. Basic word-level data
  6. Expected categories
  7. Methods for generating scales
  8. Assessment
  9. Bringing in context

Overview

In sentiment analysis, one can ignore the pos/neg distinction for only so long ... This section uses words gathered from a large online collection of informal texts to try to build lexical scales. We start by figuring out how best to study word distributions, then look at number of methods (expected categories, logistic regression, and hierarchical logistic regression) for assessing and summarizing those distributions and, in turn, for creating scales from them.

If you haven't already, get set up with the course data and code.

The data file

The file ratings-advadj.csv contains data gathered from a wide variety of websites (Amazon.com, OpenTable.com, Goodreads.com, IMDB.com) at which users can write reviews and attach star ratings to those reviews. I think the best way to get a feel for the file is to load it into R and check out its first few lines:

  1. d = read.csv('ratings-advadj.csv')
  2. head(d)
  3. Word Category Rating Type Modifier Count Total ModifierType
  4. 1 abhorrent -0.5 1 Books absolutely 2 87 U
  5. 2 abhorrent -0.5 1 Books NONE 3 36108 NONE
  6. 3 abhorrent -0.5 1 Books not 1 446 LinearDecrease
  7. 4 abhorrent -0.5 1 Books psychologically 1 3 NONE
  8. 5 abhorrent -0.5 1 Apparel NONE 1 1437 NONE
  9. 6 abhorrent -0.5 1 DVD extremely 1 524 U

Here's a rundown on the column values:

This concludes our basic tour of the file. Now on to the analyses!

The code

The file ratings.R provides a number of functions for working with the above kind of tabular data. To make these functions available, enter

  1. source('ratings.R')

The structure of the code is strongly reminiscent of the code for the Experience Project data. There are functions for extracting subframes based on various pieces of data, and there are functions for visualizing the relationships between the values. Our focus is again on the Word values, but we will build up to analyses that consider these words in their context, which is here a mix of high-level features and immediate morphosyntactic features.

Extracting subframes

The function ratingFullFrame is exactly analogous to epFullFrame from ep.R. It extracts subframes based on various supplied parameters. The required arguments are the data.frame and a word or list of words. The other values are optional. Here is a call with all of the default parameter so that you can get a feel for the options:

  1. ## An equivalent call would be ratingFullFrame(d, 'horrid').
  2. horrid = ratingFullFrame(d, 'horrid', types=NULL, modifiers=NULL, modifier.types=NULL, ratingmax=0)
  3. nrow(horrid)
  4. [1] 854
  5. head(horrid)
  6. Word Category Rating Type Modifier Count Total ModifierType Category2
  7. 167228 horrid -0.5 1 Books absolutely 2 23 U 0.25
  8. 167229 horrid -0.5 1 Books chillingly 1 2 LinearIncrease 0.25
  9. 167230 horrid -0.5 1 Books completely 1 91 LinearDecrease 0.25
  10. 167231 horrid -0.5 1 Books just 4 389 TurnedJ 0.25
  11. 167232 horrid -0.5 1 Books most 6 492 LinearIncrease 0.25
  12. 167233 horrid -0.5 1 Books NONE 125 50838 NONE 0.25

If ratingmax=5, then the IMDB data are left out. If ratingmax=10, then only the IMDB data are used. Both subsets are substantial, so both can lead to solid results. I often divide the data in this way in order to create prettier pictures.

Arguments that are left with their default values are assumed unspecified, so the data.frame is not restricted based on those values. Where they are specified, the value is limited. Here's an example in which we restrict to situations in which horrid was modified by absolutely:

  1. horrid = ratingFullFrame(d, 'horrid', modifiers='absolutely')
  2. nrow(horrid)
  3. [1] 72
  4. head(horrid)
  5. Word Category Rating Type Modifier Count Total ModifierType Category2
  6. 167228 horrid -0.5 1 Books absolutely 2 23 U 0.25
  7. 167240 horrid -0.5 1 DVD absolutely 11 305 U 0.25
  8. 167271 horrid -0.5 1 Electronics absolutely 2 62 U 0.25
  9. 167282 horrid -0.5 1 Music absolutely 9 292 U 0.25
  10. 167309 horrid -0.5 1 Software absolutely 1 17 U 0.25
  11. 167318 horrid -0.5 1 Video absolutely 5 55 U 0.25

The Modifier value NONE is special. It groups instances in which Word had no left-adjacent modifiers. Thus, the following gives the purest picture we can get of horrid:

  1. horrid = ratingFullFrame(d, 'horrid', modifiers='NONE')

The function ratingCollapsedFrame takes the same required and optional arguments, but it collapses everything down to 10 rows (5 if ratingmax=5), and it has an additional set of optional arguments for adding other values that are useful for modeling (see below).

  1. source('ratings.R')
  2. horrid = ratingCollapsedFrame(d, 'horrid', types=NULL, modifiers=NULL, modifier.types=NULL, ratingmax=0, freqs=FALSE, probs=FALSE)
  3. horrid
  4. Word Category Count Total
  5. 1 horrid -0.50000000 2610 985324
  6. 2 horrid -0.38888889 361 189710
  7. 3 horrid -0.27777778 297 226481
  8. 4 horrid -0.25000000 560 584296
  9. 5 horrid -0.16666667 225 240491
  10. 6 horrid -0.05555556 238 331450
  11. 7 horrid 0.00000000 504 1103218
  12. 8 horrid 0.05555556 208 456877
  13. 9 horrid 0.16666667 249 680880
  14. 10 horrid 0.25000000 633 2552759
  15. 11 horrid 0.27777778 301 827158
  16. 12 horrid 0.38888889 206 693679
  17. 13 horrid 0.50000000 1295 6607404
  18. horrid = ratingCollapsedFrame(d, 'horrid', types=NULL, modifiers='absolutely', modifier.types=NULL, ratingmax=0, freqs=FALSE, probs=FALSE)
  19. horrid
  20. Word Category Count Total
  21. 1 absolutely horrid -0.50000000 75 1762
  22. 2 absolutely horrid -0.38888889 13 343
  23. 3 absolutely horrid -0.27777778 8 323
  24. 4 absolutely horrid -0.25000000 33 674
  25. 5 absolutely horrid -0.16666667 5 212
  26. 6 absolutely horrid -0.05555556 8 259
  27. 7 absolutely horrid 0.00000000 17 1164
  28. 8 absolutely horrid 0.05555556 8 734
  29. 9 absolutely horrid 0.16666667 8 1248
  30. 10 absolutely horrid 0.25000000 15 3913
  31. 11 absolutely horrid 0.27777778 5 1228
  32. 12 absolutely horrid 0.38888889 3 1363
  33. 13 absolutely horrid 0.50000000 33 22663

Basic word-level data

Recall that the Experience Project data were heavily biased towards the sympathetic categories. A similar issue arises with star-ratings data: essentially all sites on the Internet that collect user-supplied reviews of this form are heavily biased towards positivity. Products that people like are purchased more and hence reviewed more. Products that people dislike are reviewed negatively, which means that fewer people buy them, which reduces the size of the reviewing pool. There is some variation by Type value, but the overall picture is one of relentless positivity.

  1. TypeTotals = function(d) {
  2. sampsize = 18
  3. par(mfrow=c(3,6), mar=c(1,1,3,1), oma=c(4,4,0,0))
  4. types = levels(d$Type)
  5. types = sample(types, sampsize, replace = FALSE, prob = NULL)
  6. for (i in 1:sampsize) {
  7. typ = subset(d, Type==types[i])
  8. totals = as.numeric(xtabs(Total ~ Category, data=typ))
  9. barplot(totals, main=types[i], axes=FALSE)
  10. }
  11. mtext('Category', side=1, outer=TRUE, cex=1.5, line=1)
  12. mtext('Total', side=2, outer=TRUE, cex=1.5, line=1)
  13. }
Figure fig:type-totals
Total values by Category for a random sample of 18 Type values (out of 31 in all).
figures/ratings/type-totals.png

Thus, if we relied on pure Count values, everything would look positive.

As before, we'll steer around this by using relative frequencies, which we obtain by dividing Count values by corresponding Total values. This can be done directly with ratingCollapsedFrame using the freqs=TRUE flag.

We can also normalize these frequencies as we did before using (Count/Total) / sum(Count/Total), which is an application of Bayes' Rule to the initial probabilities but with the prior over categories flattened out (removed) so that we don't reintroduce the underlying Rating bias. The probs=TRUE flag adds these values to collapsed frames:

  1. horrid = ratingCollapsedFrame(d, 'horrid', freqs=TRUE, probs=TRUE)
  2. horrid
  3. Word Category Count Total Freq Pr
  4. 1 horrid -0.50000000 2610 985324 0.0026488749 0.24395953
  5. 2 horrid -0.38888889 361 189710 0.0019029044 0.17525617
  6. 3 horrid -0.27777778 297 226481 0.0013113683 0.12077611
  7. 4 horrid -0.25000000 560 584296 0.0009584183 0.08826966
  8. 5 horrid -0.16666667 225 240491 0.0009355859 0.08616681
  9. 6 horrid -0.05555556 238 331450 0.0007180570 0.06613255
  10. 7 horrid 0.00000000 504 1103218 0.0004568453 0.04207514
  11. 8 horrid 0.05555556 208 456877 0.0004552648 0.04192957
  12. 9 horrid 0.16666667 249 680880 0.0003657032 0.03368101
  13. 10 horrid 0.25000000 633 2552759 0.0002479670 0.02283759
  14. 11 horrid 0.27777778 301 827158 0.0003638966 0.03351463
  15. 12 horrid 0.38888889 206 693679 0.0002969673 0.02735048
  16. 13 horrid 0.50000000 1295 6607404 0.0001959923 0.01805075

To plot the Freq or Pr values, use ratingPlot, which has the same argument structure as ratingCollapsedFrame, plus optional arguments for setting the limits on the y-axis, the color of the plots, and for depicting model fits (which we will return to below):

  1. par(mfrow=c(1,2))
  2. ratingPlot(d, 'horrid', probs=FALSE)
  3. ratingPlot(d, 'horrid', probs=TRUE)
Figure fig:horrid
The Freq and Pr values for horrid in the whole data set.
figures/ratings/horrid-freq-pr.png

The trends are clear, but the plots are a little bumpy due to the way we have combined five-star and ten-star reviews. You can clear this up with the ratingmax parameter if you like:

  1. par(mfrow=c(1,2))
  2. ratingPlot(d, 'horrid', ratingmax=5, probs=TRUE)
  3. ratingPlot(d, 'horrid', ratingmax=10, probs=TRUE)
Figure fig:horrid-ratingmax
The Pr values for horrid, split into the five-star and ten-star subsets.
figures/ratings/horrid-pr-ratingmax.png

Exercise ex:words Play around with the ratingPlot, looking systematically at classes of words that you expect to be related, or opposed, when it comes to the polarity scale we are dealing with. This will give you a feel for the data, and also, for better or worse, give you a sense for how small the vocabulary is. It's also a good idea to use the corresponding ratingCollapsedFrame function call to see what the raw data are like.

Exercise ex:shapes It's easy to find words that have strong biases for one end of the scale or the other (pos/neg). Can you find any that are biased towards the middle, or towards the edges together. What are such words like?

Exercise ex:neg Use ratingPlot to study the effects of negation. As a first step, I recommending using modifiers='NONE' to see what the basic distribution of the word is, and then adding modifiers=c("not", "n't", "never"). An example (the results are given in figure fig:good-neg):

  1. par(mfrow=c(1,2))
  2. ratingPlot(d, 'good', types=NULL, modifiers='NONE', probs=TRUE)
  3. ratingPlot(d, 'good', types=NULL, modifiers=c("not", "n't", "never"), probs=TRUE)
Figure fig:good-neg
The Pr values for good, with no modifiers (left) and with negation (right).
figures/ratings/good-neg.png

Exercise ex:mods Broaden the investigation of negation begun in exercise ex:neg to include other modifiers — perhaps those that you expect to behave differently from negation.

Expected categories

The expected category calculation is just a weighted average of Pr values. It's true to its name in the sense that it provides the best-guess value if someone gave you the word and asked you to select an appropriate Category for it.

  1. sum(horrid$Category * horrid$Pr)
  2. [1] -0.2211628

The function ExpectedCategory does this automatically:

  1. ExpectedCategory(horrid)
  2. [1] -0.2211628

You can add expected categories to ratingPlot outputs directly:

  1. ratingPlot(d, 'horrid', probs=TRUE, ec=TRUE)
Figure fig:er
Expected category value for horrid.
figures/ratings/horrid-ec.png

Exercise ex:er Return to the words you plotted earlier, but now check out their expected categories by adding them with the ec=TRUE flag. Do these values suggest a method for building pos/neg sentiment lexicons?

Logistic regression

Expected categories are easy to calculate and quite intuitive, but it is hard to know how confident we can be in them, because they are insensitive to the amount and kind of data that went into them. Suppose the EC for words v and w are both 10, but we have 500 tokens of v and just 10 tokens of w. This suggests that we can have a high degree of confidence in our EC for v, but not for w. However, EC values don't encode this uncertainty, nor is there an obvious way to capture it.

Logistic regression provides a useful way to do the work of ECs but with the added benefits of having a model and associated test statistics and measures of confidence. To start, we can fit a simple model that uses Category values to predict word usage. The intuition here is just the one that we have been working with so far: the star-ratings are correlated with the usage of some words. For a word like horrid, the correlation is negative: usage drops as the ratings get higher. For a word like amazing, the correlation is positive.

With our logistic regression models, we will essentially fit lines through our Freq data points, just as one would with a linear regression involving one predictor. However, the logistic regression model fits these values in log-odds space and uses the inverse logit function (plogis in R) to ensure that all the predicted values lie in [0,1], i.e., that they are all true probability values. Unfortunately, there is not enough time to go into much more detail about the nature of this kind of modeling. Instead, let's simply fit a model and try to build up intuitions about what it does and says:

  1. fit.horrid = glm(cbind(horrid$Count, horrid$Total-horrid$Count) ~ Category, family=quasibinomial, data=horrid)

Here, we use R's glm (generalized linear model) function to predict log-odds values based on Category. The expression cbind(horrid$Count, horrid$Total-bad$Count) is used internally by glm to derive the log-odds distribution. Category is our usual vector of Category values. The family=quasibinomial specification invokes a binomial family of the sort that characterizes logistic regression, but it should give us more conservative p values for our kind of count data than binomial would.

Let's begin by inspecting the coefficients for this fit:

  1. fit.horrid
  2. (Intercept) Category -7.376322 -2.688660

The negative sign on the coefficient for Category squares well with the fact that this is a negative word. If we fit the same model with a positive word, the Category coefficient flips its sign:

  1. amazing = ratingCollapsedFrame(d, 'amazing')
  2. fit.amazing = glm(cbind(amazing$Count, amazing$Total-amazing$Count) ~ Category,
  3. family=quasibinomial, data=amazing)
  4. fit.amazing$coef
  5. (Intercept) Category
  6. -4.609329 2.321019

The Intercept correlates with overall corpus frequency. We will mostly ignore it.

The models are more easily understood when juxtaposed with the empirical estimates on which they are based. ratingPlot makes this easy: it takes an option argument models that can have one or more models as its values. Due to an oddity of the way R stores the values, even single models have to be given inside c() (but no quotes; you're actually passing in the function!). The results of those models will be displayed as part of the empirical plot. You can define your own model functions, but ratings.R also makes some available. GlmWordLinear is the model we've been using so far.

  1. par(mfrow=c(2,2))
  2. ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordLinear), ratingmax=5, ylim=c(0, 0.5))
  3. ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordLinear), ratingmax=10, ylim=c(0, 0.3))
  4. ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordLinear), ratingmax=5, ylim=c(0, 0.5))
  5. ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordLinear), ratingmax=10, ylim=c(0, 0.3))
Figure fig:glm
Plots for horrid and amazing with simple GLM model fits included.
figures/ratings/glm-horrid-amazing.png

The above models generally look good. In some cases, though, the assumption that frequencies are linearly correlated with the Category is clearly false. For example, good and disappointing are mildly positive and negative, respectively, which gives them an arch-like distribution. This suggests that we would do well to include the squared Category value as a predictor. ratings.R has such a model prespecified: GlmWordQuadratic.

  1. par(mfrow=c(2,2))
  2. ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
  3. ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=10, ylim=c(0, 0.3))
  4. ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
  5. ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=10, ylim=c(0, 0.3))
Figure fig:glm2
Plots for horrid and amazing with quadratic GLM model fits included.
figures/ratings/glm2-horrid-amazing.png

Exercise ex:logit Once again, return to the words you plotted earlier, but now try to determine whether GlmWordLinear or GlmWordQuadratic provide good model fits for them. As you do this, consider which values you might be able to use as sentiment scores.

Methods for generating scales

We can use the EC and logistic regression values to order all the lexical items. First, let's create a separate table containing just the EC value, Category coefficient, and Category p value for each word in the vocabulary. To do this efficiently, I first define an auxiliary function that, given a word's frame, gets these values for us:

  1. ## This subfunction calculates assessment values for a word based on a subframe.
  2. ## Argument: A Word's subframe directly from the main CSV source file.
  3. ## Value: a vector (ec, coef, coef.p, tokencount).
  4. WordAssess = function(pf){
  5. ## ER value.
  6. ec = ExpectedCategory(pf)
  7. ## Logit and the coefficient values. (We don't need the intercept values.)
  8. fit = GlmWordLinear(pf)
  9. coef = fit$coef[2]
  10. coef.p = summary(fit)$coef[2,4]
  11. ## Return a vector of assessment values, which ddply will add to its columns:
  12. return(c(ec, coef, coef.p, sum(pf$Count)))
  13. }

The function ddply from the plyr library efficiently handles grouping words (based on Word identity) and sending those subframes to WordAssess for the needed values, then adding them to a data.frame:

  1. ## Load the library for the ddply function.
  2. library(plyr)
  3. VocabAssess = function(df){
  4. ## ddply takes care of grouping into words and
  5. ## then applying WordAssess to those subframes.
  6. ## It will also display a textual progress bar:
  7. vals = ddply(df, c('Word'), WordAssess, .progress='text')
  8. ## Add intuitive column names:
  9. colnames(vals) = c('Word', 'EC', 'CategoryCoef', 'P', 'Tokencount')
  10. return(vals)
  11. }

It will take your computer a while to generate the values. For safety, save to a file so that you can use it later:

  1. ratings.assess = VocabAssess(d)
  2. write.csv(ratings.assess, 'ratings-words-assess.csv', row.names=FALSE)

The data.frame ratings.assess has the potential to deliver us a pos/neg sentiment lexicon with continuous values. Here is a flexible function for working with it:

  1. PosnegLexicon = function(ratingslex, useEC=TRUE, threshold=1) {
  2. lex = subset(ratingslex, P <= threshold)
  3. if (useEC) {
  4. lex = lex[order(lex$EC), ]
  5. } else {
  6. lex = lex[order(lex$CategoryCoef), ]
  7. }
  8. return(lex)
  9. }

Exercise ex:lexuse Use PosnegLexicon to study the lexicons we can obtain with the above methods. What are their strengths and weaknesses?

Exercise ex:pvals How many of the words in our lexicon have significant Category coefficients? Which words fail to achieve this level? Why?

Exercise ex:corr How do the EC and Category values relate to each other? Are they highly correlated, or not? What kinds of words end up with very different EC and Category values?

Exercise ex:quad The method for generating lexicons defined above does not use the model GlmWordQuadratic, which has an additional predictor for Category. How might we effectively use this model? Should it replace GlmWordLinear, or should we use them both?

Exercise ex:wordcmp The above lexicon-generation method might make statisticians unhappy because, strictly speaking, we can't compare coefficients from different fitted models and draw conclusions about the strength of the effects they capture. However, a simple extension of our basic approach is pretty respectable. The basic strategy is to make pairwise comparisons in the model. To do this, create two collapsed frames pf1 and pf2 using ratingCollapsedFrame, bind them together with pf = rbind(pf1, pf2), and then extend the frame with pf$Stronger = pf$Word == as.character(pf1$Word[1]). Now use Stronger strategically in your model. (If you decide to pursue this question, perhaps write to me — I have some annotated data that can be used for assessment.)

Assessment

Having generated a lexicon, one naturally wonders how good it is. Luckily, there are existing pos/neg lexicons that we can compare against. Here, I use the very rich MPQA lexicon, which classifies words along two dimensions: positive/neutral/negative and weaksubj/strongsubj. This file is included in today's code distribution:

  1. mpqa = read.csv('mpqa-lexicon.csv')
  2. head(mpqa)
  3. Word POS Polarity Subjectivity
  4. 1 icy a -1 1
  5. 2 hurt v -1 1
  6. 3 reprovingly v -1 2
  7. 4 easier r 1 1
  8. 5 influential a 1 1
  9. 6 insolently a -1 2

I've mapped the MPQA string Polarity values into numbers: -1 is negative, 0 is neutral, and 1 is positive. (I'll ignore the Subjectivity values, but I also mapped them to numbers: 1 is weaksubj and 2 is strongsubj.)

The MPQA's vocabulary is very large, and it includes words of different parts of speech. Our lexicon of ratings data contains only adjectives, so I'll restrict the frame to just the adjectival data:

  1. mpqa.adj = subset(mpqa, POS=='a')

We're now in a position to compare our derived values against the MPQA. The first step is to add an MPQA column to our lexicon frame ratings.assess. I again call on ddply:

  1. ## Function for adding an MPQA value to a word's frame:
  2. AddMpqa = function(pf) {
  3. word = as.character(pf$Word)
  4. if (word %in% mpqa.adj$Word) {
  5. val = as.numeric(subset(mpqa.adj, Word==word)$Polarity)
  6. }
  7. else {
  8. val = NA
  9. }
  10. pf$MPQA = val
  11. return(pf)
  12. }
  13.  
  14. ## Now call ddply to extend the frame with this new column:
  15. ratings.assess = ddply(ratings.assess, .variables=colnames(ratings.assess), .fun=AddMpqa, .progress='text')

The final step is to use our values to make polarity predictions. There are lots of ways we could do this. My initial proposal is to assume that words for which we don't have miniscule p-values are words that are neutral (Category doesn't predict their usage), and then to use the sign of the Category coefficient from there:

  1. ## Add the predictions:
  2. ratings.assess$Prediction = ifelse(ratings.assess$P < 0.00001, 0,
  3. ifelse(ratings.assess$CategoryCoef > 0, 1,
  4. -1))
  5. ## Cross-tabulate the results for assessment:
  6. x = xtabs(~ MPQA + Prediction, data=ratings.assess)
  7. x
  8. Prediction
  9. MPQA -1 0 1
  10. -1 108 39 26
  11. 0 8 2 5
  12. 1 56 41 81

The confusion matrix we just created has the gold-standard data going row-wise and our predictions going column-wise. The sum of the diagonals divided by the total is our accuracy:

  1. sum(diag(x)) / sum(x)
  2. [1] 0.5218579

Accuracy is not the best measure in this case because the gold-standard data are imbalanced, with very few neutral words:

  1. apply(x, 1, sum)
  2. -1 0 1
  3. 173 15 178

It is better to balance prediction and recall when thinking about how we did:

  1. ## Precision divides the true positives for category i by the sum of all positive cases of i:
  2. Precision = function(x, i) { x[i,i] / sum(x[i, ]) }
  3.  
  4. > Precision(x, 1)
  5. [1] 0.6242775
  6. > Precision(x, 2)
  7. [1] 0.1333333
  8. > Precision(x, 3)
  9. [1] 0.4550562
  10.  
  11. ## Recall divides the true positives by the sum of all the guesses we made for category i:
  12. Recall = function(x, i) { x[i,i] / sum(x[, i]) }
  13.  
  14. Recall(x, 1)
  15. [1] 0.627907
  16. Recall(x, 2)
  17. [1] 0.02439024
  18. Recall(x, 3)
  19. [1] 0.7232143

Clearly, the neutral category is problematic for us. This is because our model predicts this category much more often than it is attested in MPQA. This seems like an inherent drawback to the assessment method: we just have different standards for neutrality at work. (Perhaps we would do better to assess only against the strongsubj cases.) However, the pos/neg confusions should be addressable. Let's look at a random sample of the errors from our worst category in terms of precision:

  1. neg.fail = subset(ratings.assess, MPQA==-1 & MPQA != Prediction)
  2. neg.fail[sample(nrow(neg.fail, 20)), ]
  3. Word EC CategoryCoef P Tokencount Prediction MPQA
  4. 250 miserable -0.08575085 -1.62904156 1.047323e-11 9497 0 -1
  5. 276 poor -0.15553452 -2.17494454 0.000000e+00 141630 0 -1
  6. 26 angry -0.03425602 -0.60356329 2.301931e-12 39230 0 -1
  7. 43 audacious 0.15390341 0.22513033 8.629755e-01 1834 1 -1
  8. 247 melancholy 0.17546686 1.48608678 1.201083e-04 13232 1 -1
  9. 105 depressing 0.01700738 -0.91394117 1.480040e-10 31365 0 -1
  10. 337 tragic 0.20913108 0.39607406 9.764106e-03 35579 1 -1
  11. 303 scathing 0.14579717 0.09534574 9.149460e-01 2641 1 -1
  12. 15 aggressive 0.12895422 0.42992941 2.104427e-01 13515 1 -1
  13. 136 empty -0.03145483 -1.18175665 3.050219e-08 36553 0 -1
  14. 47 bad -0.12220433 -1.80590776 0.000000e+00 710896 0 -1
  15. 129 dull -0.12414360 -2.08129609 2.222769e-104 49574 0 -1
  16. 133 edgy 0.03534387 0.08147935 7.650492e-01 10522 1 -1
  17. 308 sharp 0.16568395 0.45674525 1.601153e-03 42703 1 -1
  18. 352 unpleasant -0.06834313 -1.66741984 2.343287e-07 9491 0 -1
  19. 301 sad 0.11151508 -0.34445251 2.198336e-10 122005 0 -1
  20. 361 weak -0.04389323 -1.35969281 2.694039e-98 81922 0 -1
  21. 98 dead 0.04158026 -0.64421607 2.909988e-34 189734 0 -1
  22. 67 brutal 0.16066174 0.43316829 1.356115e-03 36470 1 -1
  23. 257 nervous 0.22557651 0.31895246 3.455089e-01 12391 1 -1

If one samples around with ratingPlot, it quickly becomes clear that many of the failures are failures of our underlying model: words like sad are too weak to have truly significant linear relationships to the categories. They call for the quadratic model, making it all the more pressing that we figure out a way to bring those values into our lexicon generation method (exercise ex:quad; a hint: both the quadratic and the linear Category coefficients can be used as scores, and they can even be effectively compared, as long as they are normalized somewhat, since the quadratic scores are generally much smaller than the linear ones.)

Exercise ex:assess-ec Add a column ECPrediction to the ratings.assess frame and fill it with EC values, then re-run the assessment. How do the results compare to those obtained with the CategoryCoef values?

Exercise ex:subj The MPQA data.frame also supplies subjectivity values (1 is weaksubj and 2 is strongsubj). Do these value correlate with our sentiment scores?

Exercise ex:op Today's file also contains a CSV file containing the contents of Bing Liu's Opinion Lexicon:. Assess our own lexicon against this.

Bringing in context

In the section on Experience Project data, we used first logistic regression and then hierarchical logistic regression to model the relationships between words and contextual variables. I pursue a version of the hierarchical strategy here. The above models are insensitive to all kinds of context, but we've seen already that such variables matter. It's time to bring it in to our models and, in turn, into our lexicons!

ratings.R makes available a few hierarchical models. LmerWordType fits a hierarchical model that uses Category as the sole fixed-effects predictor and Type as its sole hierarchical predictor. This model delivers a fixed effects estimate for Category (a kind of weighted average over the different subclasses given by Type), and it also provides such estimates for the intercept and Category values for each value of Type. Thus, we have one pooled model (fixed effects) and 1 model for each of the 31 different Type values:

  1. horrid = ratingFullFrame(d, word='horrid')
  2. horrid.lmer = LmerWordType(horrid)
  3. fixef(horrid.lmer)
  4. (Intercept) Category
  5. -7.565669 -2.644308
  6. coef(horrid.lmer)
  7. $Type
  8. (Intercept) Category
  9. Action -7.274226 -2.561226
  10. Adventure -7.274037 -2.667250
  11. Animation -7.147119 -2.682686
  12. Apparel -7.409348 -2.620771
  13. Beauty -7.270026 -2.441967
  14. Books -7.422696 -2.313351
  15. Comedy -7.331637 -2.906460
  16. Computers -7.778343 -2.671019
  17. Crime -7.273049 -2.341563
  18. Documentary -7.208281 -1.889977
  19. Drama -7.254646 -2.402365
  20. DVD -7.146788 -2.230678
  21. Electronics -7.902919 -2.277001
  22. Home -8.307209 -2.358733
  23. Horror -6.938018 -2.160266
  24. Hotels -8.179016 -2.892307
  25. Instruments -7.165105 -2.583995
  26. Magazines -7.531645 -2.605249
  27. Music -7.465000 -3.308469
  28. Office Products -7.456904 -2.627931
  29. Photography -8.287318 -3.209077
  30. Restaurants -9.015850 -3.897194
  31. Software -7.988622 -2.706323
  32. Sports -7.953478 -2.928057
  33. Technology -7.465764 -2.908992
  34. Tools -7.730007 -2.751685
  35. Video -7.153390 -2.379736
  36. Video Games -7.107696 -2.492795

The function ratingPlotHierarchicalCoefficients plots the fixed and hierarchical coefficients. The result is a ranking, top to bottom, but with the labels jittered randomly left to right so that nearby values don't sit ontop of each other. The fixed-effect estimate is given as a big green dot:

  1. par(mfrow=c(1,3))
  2. ratingPlotHierarchicalCoefficients(d, 'horrid', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')
  3. ratingPlotHierarchicalCoefficients(d, 'hilarious', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')
  4. ratingPlotHierarchicalCoefficients(d, 'surprising', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')
Figure fig:lmer-types
Hierarchical coefficient estimates for Type.
figures/ratings/lmer-types.png

The model LmerWordModifier uses the same technique, but with Modifier as the hierarchical effect. The results provide a glimpse of the multifaceted ways in which adverbs modulate the meanings of the words they modify.

  1. ratingPlotHierarchicalCoefficients(d, 'disappointing', LmerWordModifier, fixed.coef='Category', hierarchical.coef='Modifier')
Figure fig:lmer-mods
Hierarchical coefficient estimates for Modifier.
figures/ratings/lmer-mods.png

Exercise ex:modshapes The ModifierType values are shapes that correspond intuitively to the picture we get when regressing the frequency of Modifier on Category or Category2. These provide rough semantic classifications, and the resulting models are more constrained than those obtained directly from the modifiers. ratings.R contains a model LmerWordModifierType for exploring these values with respect to specific words. Use ratingPlotHierarchicalCoefficients to get a feel for what the shapes are like and how they might be used in semantic/sentiment analysis.

Exercise ex:neg-estimates Use the above modeling techniques (any of them) to study the effects of negation. Are there generalizations we can make about how negation affects the sentiment scores we can derive from these data?

Exercise ex:larger The vocabulary of ratings-advadj.csv might start to feel confining after a little while. The file imdb-unigrams.csv in the code distribution for my SALT 20 paper is compatible with the above functions. It doesn't have the contextual features, but it has a truly massive vocabulary: potts-salt20-data-and-code.zip.

Exercise ex:assess-lmer As a first step towards building a context-aware lexicon, adapt the scale generation method of section 7 above to the hierarchical setting by using the fixef coefficients in place of the GLM ones, using one of the lmer models discussed above. Then assess the resulting lexicon against the MPQA as in section 8.