Adverb–adjective combinations in reviews

Overview
The data file
The code
Extracting subframes
Basic word-level data
Expected categories
Methods for generating scales
Assessment
Bringing in context

Overview

In sentiment analysis, one can ignore the pos/neg distinction for only so long ... This section uses words gathered from a large online collection of informal texts to try to build lexical scales. We start by figuring out how best to study word distributions, then look at number of methods (expected categories, logistic regression, and hierarchical logistic regression) for assessing and summarizing those distributions and, in turn, for creating scales from them.

If you haven't already, get set up with the course data and code.

The data file

The file ratings-advadj.csv contains data gathered from a wide variety of websites (Amazon.com, OpenTable.com, Goodreads.com, IMDB.com) at which users can write reviews and attach star ratings to those reviews. I think the best way to get a feel for the file is to load it into R and check out its first few lines:

d = read.csv('ratings-advadj.csv')
head(d)
Word Category Rating Type Modifier Count Total ModifierType
1 abhorrent -0.5 1 Books absolutely 2 87 U
2 abhorrent -0.5 1 Books NONE 3 36108 NONE
3 abhorrent -0.5 1 Books not 1 446 LinearDecrease
4 abhorrent -0.5 1 Books psychologically 1 3 NONE
5 abhorrent -0.5 1 Apparel NONE 1 1437 NONE
6 abhorrent -0.5 1 DVD extremely 1 524 U

Here's a rundown on the column values:

The Word values constitute the whole vocabulary. The vocabulary for this file is fairly small (366 words). It consists of a core set of adjectives from WordNet. I kept the vocabulary small so that I could include a lot of metadata without having the file-size get too large (though it is still more than 30MB and thus might take a little while to load). You can use
1. levels(d$Word)
to view the whole vocabulary and nlevels(d$Word) to count it. It's too big to reproduce here, but do check it out, if only to scan it quickly to get a feel for what's in there.
The Rating values are raw star ratings that users provided. The file is mixed for this value. A large chunk of the data comes from sites where the ratings go from 1-5 stars. The data derived from IMDB involve star ratings that go from 1-10 stars.
Because the Rating values are mixed, I've included the Category values. These rescale the ratings so that they run from -0.5 to 0.5, centered at 0.0. The values for the 1-5 and 1-10 star data are still different, but we can sensibly include them in the same models if we assume that they are fundamentally continuous values (even if users can only pick integers).
The Type values are very high-level topic values:
1. levels(d$Type)
2. [1] "Action" "Adventure" "Animation" "Apparel"
3. [5] "Automotive" "Beauty" "Books" "Comedy"
4. [9] "Computers" "Crime" "Documentary" "Drama"
5. [13] "DVD" "Electronics" "Home" "Horror"
6. [17] "Hotels" "Instruments" "Magazines" "Movies"
7. [21] "Music" "Office Products" "Photography" "Restaurants"
8. [25] "Software" "Sports" "Technology" "Tools"
9. [29] "Toys" "Video" "Video Games"
For many of these categories, the sources are somewhat disparate. For example, Books data come from Amazon.com as well as Goodreads.com. However, the ones for movie genres are strictly from IMDB and thus allow us to extract the data that have a 10-star basis:
1. ImdbTypes = c('Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Drama', 'Horror', 'Documentary')
2. imdb = subset(d, Type %in% ImdbTypes)
3. unique(imdb$Category)
4. fivestar = subset(d, !Type %in% ImdbTypes)
5. unique(fivestar$Category)
6. [1] -0.50 -0.25 0.00 0.25 0.50
It will occasionally be useful to be able to split the data apart in this way. But the real message is that you can easily segment the data by Type. If you want to study just the way Word behaves with respect to Books, or Technology, etc., then you can grab subsets based on that value.
The values for Modifier are adverbials that appeared immediately before Word in the underlying texts. NONE indicates that there was no detectable adverbial. Many of these modifiers are single words, but some are phrasal. There are a lot of modifiers: nlevels(d$Modifier) == 863.
ModifierType is used to group the modifiers using the statistical methods defined below. I'll shortly review the methods used to create these values. For now, you can think of arbitrarily labels that cluster groups of Modifier values into one of 9 groups:
1. levels(d$ModifierType)
2. [1] "7" "J" "LinearDecrease" "LinearIncrease"
3. [5] "NONE" "Reverse7" "TurnedJ" "TurnedU"
4. [9] "U"
The Count and Total values are the crucial data. They are conceptually like the corresponding values in the Experience Project data. Let's look at the top of the file again:
1. head(d)
2. Word Category Rating Type Modifier Count Total ModifierType
3. 1 abhorrent -0.5 1 Books absolutely 2 87 U
4. 2 abhorrent -0.5 1 Books NONE 3 36108 NONE
5. 3 abhorrent -0.5 1 Books not 1 446 LinearDecrease
6. 4 abhorrent -0.5 1 Books psychologically 1 3 NONE
7. 5 abhorrent -0.5 1 Apparel NONE 1 1437 NONE
8. 6 abhorrent -0.5 1 DVD extremely 1 524 U
In row 1, the Count value of 2 means that the corpus contains 2 tokens of abhorrent (Word column) in book reviews (Type column) modified by absolutely (Modifier). The Total value gives the sum of all such tokens, summing over all words.

This concludes our basic tour of the file. Now on to the analyses!

The code

The file ratings.R provides a number of functions for working with the above kind of tabular data. To make these functions available, enter

source('ratings.R')

The structure of the code is strongly reminiscent of the code for the Experience Project data. There are functions for extracting subframes based on various pieces of data, and there are functions for visualizing the relationships between the values. Our focus is again on the Word values, but we will build up to analyses that consider these words in their context, which is here a mix of high-level features and immediate morphosyntactic features.

Extracting subframes

The function ratingFullFrame is exactly analogous to epFullFrame from ep.R. It extracts subframes based on various supplied parameters. The required arguments are the data.frame and a word or list of words. The other values are optional. Here is a call with all of the default parameter so that you can get a feel for the options:

## An equivalent call would be ratingFullFrame(d, 'horrid').
horrid = ratingFullFrame(d, 'horrid', types=NULL, modifiers=NULL, modifier.types=NULL, ratingmax=0)
nrow(horrid)
[1] 854
head(horrid)
Word Category Rating Type Modifier Count Total ModifierType Category2
167228 horrid -0.5 1 Books absolutely 2 23 U 0.25
167229 horrid -0.5 1 Books chillingly 1 2 LinearIncrease 0.25
167230 horrid -0.5 1 Books completely 1 91 LinearDecrease 0.25
167231 horrid -0.5 1 Books just 4 389 TurnedJ 0.25
167232 horrid -0.5 1 Books most 6 492 LinearIncrease 0.25
167233 horrid -0.5 1 Books NONE 125 50838 NONE 0.25

If ratingmax=5, then the IMDB data are left out. If ratingmax=10, then only the IMDB data are used. Both subsets are substantial, so both can lead to solid results. I often divide the data in this way in order to create prettier pictures.

Arguments that are left with their default values are assumed unspecified, so the data.frame is not restricted based on those values. Where they are specified, the value is limited. Here's an example in which we restrict to situations in which horrid was modified by absolutely:

horrid = ratingFullFrame(d, 'horrid', modifiers='absolutely')
nrow(horrid)
[1] 72
head(horrid)
Word Category Rating Type Modifier Count Total ModifierType Category2
167228 horrid -0.5 1 Books absolutely 2 23 U 0.25
167240 horrid -0.5 1 DVD absolutely 11 305 U 0.25
167271 horrid -0.5 1 Electronics absolutely 2 62 U 0.25
167282 horrid -0.5 1 Music absolutely 9 292 U 0.25
167309 horrid -0.5 1 Software absolutely 1 17 U 0.25
167318 horrid -0.5 1 Video absolutely 5 55 U 0.25

The Modifier value NONE is special. It groups instances in which Word had no left-adjacent modifiers. Thus, the following gives the purest picture we can get of horrid:

horrid = ratingFullFrame(d, 'horrid', modifiers='NONE')

The function ratingCollapsedFrame takes the same required and optional arguments, but it collapses everything down to 10 rows (5 if ratingmax=5), and it has an additional set of optional arguments for adding other values that are useful for modeling (see below).

source('ratings.R')
horrid = ratingCollapsedFrame(d, 'horrid', types=NULL, modifiers=NULL, modifier.types=NULL, ratingmax=0, freqs=FALSE, probs=FALSE)
horrid
Word Category Count Total
1 horrid -0.50000000 2610 985324
2 horrid -0.38888889 361 189710
3 horrid -0.27777778 297 226481
4 horrid -0.25000000 560 584296
5 horrid -0.16666667 225 240491
6 horrid -0.05555556 238 331450
7 horrid 0.00000000 504 1103218
8 horrid 0.05555556 208 456877
9 horrid 0.16666667 249 680880
10 horrid 0.25000000 633 2552759
11 horrid 0.27777778 301 827158
12 horrid 0.38888889 206 693679
13 horrid 0.50000000 1295 6607404
horrid = ratingCollapsedFrame(d, 'horrid', types=NULL, modifiers='absolutely', modifier.types=NULL, ratingmax=0, freqs=FALSE, probs=FALSE)
horrid
Word Category Count Total
1 absolutely horrid -0.50000000 75 1762
2 absolutely horrid -0.38888889 13 343
3 absolutely horrid -0.27777778 8 323
4 absolutely horrid -0.25000000 33 674
5 absolutely horrid -0.16666667 5 212
6 absolutely horrid -0.05555556 8 259
7 absolutely horrid 0.00000000 17 1164
8 absolutely horrid 0.05555556 8 734
9 absolutely horrid 0.16666667 8 1248
10 absolutely horrid 0.25000000 15 3913
11 absolutely horrid 0.27777778 5 1228
12 absolutely horrid 0.38888889 3 1363
13 absolutely horrid 0.50000000 33 22663

Basic word-level data

Recall that the Experience Project data were heavily biased towards the sympathetic categories. A similar issue arises with star-ratings data: essentially all sites on the Internet that collect user-supplied reviews of this form are heavily biased towards positivity. Products that people like are purchased more and hence reviewed more. Products that people dislike are reviewed negatively, which means that fewer people buy them, which reduces the size of the reviewing pool. There is some variation by Type value, but the overall picture is one of relentless positivity.

TypeTotals = function(d) {
sampsize = 18
par(mfrow=c(3,6), mar=c(1,1,3,1), oma=c(4,4,0,0))
types = levels(d$Type)
types = sample(types, sampsize, replace = FALSE, prob = NULL)
for (i in 1:sampsize) {
typ = subset(d, Type==types[i])
totals = as.numeric(xtabs(Total ~ Category, data=typ))
barplot(totals, main=types[i], axes=FALSE)
}
mtext('Category', side=1, outer=TRUE, cex=1.5, line=1)
mtext('Total', side=2, outer=TRUE, cex=1.5, line=1)
}

Figure fig:type-totals

Total values by Category for a random sample of 18 Type values (out of 31 in all).

Thus, if we relied on pure Count values, everything would look positive.

As before, we'll steer around this by using relative frequencies, which we obtain by dividing Count values by corresponding Total values. This can be done directly with ratingCollapsedFrame using the freqs=TRUE flag.

We can also normalize these frequencies as we did before using (Count/Total) / sum(Count/Total), which is an application of Bayes' Rule to the initial probabilities but with the prior over categories flattened out (removed) so that we don't reintroduce the underlying Rating bias. The probs=TRUE flag adds these values to collapsed frames:

horrid = ratingCollapsedFrame(d, 'horrid', freqs=TRUE, probs=TRUE)
horrid
Word Category Count Total Freq Pr
1 horrid -0.50000000 2610 985324 0.0026488749 0.24395953
2 horrid -0.38888889 361 189710 0.0019029044 0.17525617
3 horrid -0.27777778 297 226481 0.0013113683 0.12077611
4 horrid -0.25000000 560 584296 0.0009584183 0.08826966
5 horrid -0.16666667 225 240491 0.0009355859 0.08616681
6 horrid -0.05555556 238 331450 0.0007180570 0.06613255
7 horrid 0.00000000 504 1103218 0.0004568453 0.04207514
8 horrid 0.05555556 208 456877 0.0004552648 0.04192957
9 horrid 0.16666667 249 680880 0.0003657032 0.03368101
10 horrid 0.25000000 633 2552759 0.0002479670 0.02283759
11 horrid 0.27777778 301 827158 0.0003638966 0.03351463
12 horrid 0.38888889 206 693679 0.0002969673 0.02735048
13 horrid 0.50000000 1295 6607404 0.0001959923 0.01805075

To plot the Freq or Pr values, use ratingPlot, which has the same argument structure as ratingCollapsedFrame, plus optional arguments for setting the limits on the y-axis, the color of the plots, and for depicting model fits (which we will return to below):

par(mfrow=c(1,2))
ratingPlot(d, 'horrid', probs=FALSE)
ratingPlot(d, 'horrid', probs=TRUE)

Figure fig:horrid

The Freq and Pr values for horrid in the whole data set.

The trends are clear, but the plots are a little bumpy due to the way we have combined five-star and ten-star reviews. You can clear this up with the ratingmax parameter if you like:

par(mfrow=c(1,2))
ratingPlot(d, 'horrid', ratingmax=5, probs=TRUE)
ratingPlot(d, 'horrid', ratingmax=10, probs=TRUE)

Figure fig:horrid-ratingmax

The Pr values for horrid, split into the five-star and ten-star subsets.

Exercise ex:words Play around with the ratingPlot, looking systematically at classes of words that you expect to be related, or opposed, when it comes to the polarity scale we are dealing with. This will give you a feel for the data, and also, for better or worse, give you a sense for how small the vocabulary is. It's also a good idea to use the corresponding ratingCollapsedFrame function call to see what the raw data are like.

Exercise ex:shapes It's easy to find words that have strong biases for one end of the scale or the other (pos/neg). Can you find any that are biased towards the middle, or towards the edges together. What are such words like?

Exercise ex:neg Use ratingPlot to study the effects of negation. As a first step, I recommending using modifiers='NONE' to see what the basic distribution of the word is, and then adding modifiers=c("not", "n't", "never"). An example (the results are given in figure fig:good-neg):

par(mfrow=c(1,2))
ratingPlot(d, 'good', types=NULL, modifiers='NONE', probs=TRUE)
ratingPlot(d, 'good', types=NULL, modifiers=c("not", "n't", "never"), probs=TRUE)

Figure fig:good-neg

The Pr values for good, with no modifiers (left) and with negation (right).

Exercise ex:mods Broaden the investigation of negation begun in exercise ex:neg to include other modifiers — perhaps those that you expect to behave differently from negation.

Expected categories

The expected category calculation is just a weighted average of Pr values. It's true to its name in the sense that it provides the best-guess value if someone gave you the word and asked you to select an appropriate Category for it.

sum(horrid$Category * horrid$Pr)
[1] -0.2211628

The function ExpectedCategory does this automatically:

ExpectedCategory(horrid)
[1] -0.2211628

You can add expected categories to ratingPlot outputs directly:

ratingPlot(d, 'horrid', probs=TRUE, ec=TRUE)

Figure fig:er

Expected category value for horrid.

Exercise ex:er Return to the words you plotted earlier, but now check out their expected categories by adding them with the ec=TRUE flag. Do these values suggest a method for building pos/neg sentiment lexicons?

Logistic regression

Expected categories are easy to calculate and quite intuitive, but it is hard to know how confident we can be in them, because they are insensitive to the amount and kind of data that went into them. Suppose the EC for words v and w are both 10, but we have 500 tokens of v and just 10 tokens of w. This suggests that we can have a high degree of confidence in our EC for v, but not for w. However, EC values don't encode this uncertainty, nor is there an obvious way to capture it.

Logistic regression provides a useful way to do the work of ECs but with the added benefits of having a model and associated test statistics and measures of confidence. To start, we can fit a simple model that uses Category values to predict word usage. The intuition here is just the one that we have been working with so far: the star-ratings are correlated with the usage of some words. For a word like horrid, the correlation is negative: usage drops as the ratings get higher. For a word like amazing, the correlation is positive.

With our logistic regression models, we will essentially fit lines through our Freq data points, just as one would with a linear regression involving one predictor. However, the logistic regression model fits these values in log-odds space and uses the inverse logit function (plogis in R) to ensure that all the predicted values lie in [0,1], i.e., that they are all true probability values. Unfortunately, there is not enough time to go into much more detail about the nature of this kind of modeling. Instead, let's simply fit a model and try to build up intuitions about what it does and says:

fit.horrid = glm(cbind(horrid$Count, horrid$Total-horrid$Count) ~ Category, family=quasibinomial, data=horrid)

Here, we use R's glm (generalized linear model) function to predict log-odds values based on Category. The expression cbind(horrid$Count, horrid$Total-bad$Count) is used internally by glm to derive the log-odds distribution. Category is our usual vector of Category values. The family=quasibinomial specification invokes a binomial family of the sort that characterizes logistic regression, but it should give us more conservative p values for our kind of count data than binomial would.

Let's begin by inspecting the coefficients for this fit:

fit.horrid
(Intercept) Category -7.376322 -2.688660

The negative sign on the coefficient for Category squares well with the fact that this is a negative word. If we fit the same model with a positive word, the Category coefficient flips its sign:

amazing = ratingCollapsedFrame(d, 'amazing')
fit.amazing = glm(cbind(amazing$Count, amazing$Total-amazing$Count) ~ Category,
family=quasibinomial, data=amazing)
fit.amazing$coef
(Intercept) Category
-4.609329 2.321019

The Intercept correlates with overall corpus frequency. We will mostly ignore it.

The models are more easily understood when juxtaposed with the empirical estimates on which they are based. ratingPlot makes this easy: it takes an option argument models that can have one or more models as its values. Due to an oddity of the way R stores the values, even single models have to be given inside c() (but no quotes; you're actually passing in the function!). The results of those models will be displayed as part of the empirical plot. You can define your own model functions, but ratings.R also makes some available. GlmWordLinear is the model we've been using so far.

par(mfrow=c(2,2))
ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordLinear), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordLinear), ratingmax=10, ylim=c(0, 0.3))
ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordLinear), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordLinear), ratingmax=10, ylim=c(0, 0.3))

Figure fig:glm

Plots for horrid and amazing with simple GLM model fits included.

The above models generally look good. In some cases, though, the assumption that frequencies are linearly correlated with the Category is clearly false. For example, good and disappointing are mildly positive and negative, respectively, which gives them an arch-like distribution. This suggests that we would do well to include the squared Category value as a predictor. ratings.R has such a model prespecified: GlmWordQuadratic.

par(mfrow=c(2,2))
ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'good', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=10, ylim=c(0, 0.3))
ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'disappointing', probs=TRUE, models=c(GlmWordQuadratic), ratingmax=10, ylim=c(0, 0.3))

Figure fig:glm2

Plots for horrid and amazing with quadratic GLM model fits included.

Exercise ex:logit Once again, return to the words you plotted earlier, but now try to determine whether GlmWordLinear or GlmWordQuadratic provide good model fits for them. As you do this, consider which values you might be able to use as sentiment scores.

Methods for generating scales

We can use the EC and logistic regression values to order all the lexical items. First, let's create a separate table containing just the EC value, Category coefficient, and Category p value for each word in the vocabulary. To do this efficiently, I first define an auxiliary function that, given a word's frame, gets these values for us:

## This subfunction calculates assessment values for a word based on a subframe.
## Argument: A Word's subframe directly from the main CSV source file.
## Value: a vector (ec, coef, coef.p, tokencount).
WordAssess = function(pf){
## ER value.
ec = ExpectedCategory(pf)
## Logit and the coefficient values. (We don't need the intercept values.)
fit = GlmWordLinear(pf)
coef = fit$coef[2]
coef.p = summary(fit)$coef[2,4]
## Return a vector of assessment values, which ddply will add to its columns:
return(c(ec, coef, coef.p, sum(pf$Count)))
}

The function ddply from the plyr library efficiently handles grouping words (based on Word identity) and sending those subframes to WordAssess for the needed values, then adding them to a data.frame:

## Load the library for the ddply function.
library(plyr)
VocabAssess = function(df){
## ddply takes care of grouping into words and
## then applying WordAssess to those subframes.
## It will also display a textual progress bar:
vals = ddply(df, c('Word'), WordAssess, .progress='text')
## Add intuitive column names:
colnames(vals) = c('Word', 'EC', 'CategoryCoef', 'P', 'Tokencount')
return(vals)
}

It will take your computer a while to generate the values. For safety, save to a file so that you can use it later:

ratings.assess = VocabAssess(d)
write.csv(ratings.assess, 'ratings-words-assess.csv', row.names=FALSE)

The data.frame ratings.assess has the potential to deliver us a pos/neg sentiment lexicon with continuous values. Here is a flexible function for working with it:

PosnegLexicon = function(ratingslex, useEC=TRUE, threshold=1) {
lex = subset(ratingslex, P <= threshold)
if (useEC) {
lex = lex[order(lex$EC), ]
} else {
lex = lex[order(lex$CategoryCoef), ]
}
return(lex)
}

Exercise ex:lexuse Use PosnegLexicon to study the lexicons we can obtain with the above methods. What are their strengths and weaknesses?

Exercise ex:pvals How many of the words in our lexicon have significant Category coefficients? Which words fail to achieve this level? Why?

Exercise ex:corr How do the EC and Category values relate to each other? Are they highly correlated, or not? What kinds of words end up with very different EC and Category values?

Exercise ex:quad The method for generating lexicons defined above does not use the model GlmWordQuadratic, which has an additional predictor for Category. How might we effectively use this model? Should it replace GlmWordLinear, or should we use them both?

Exercise ex:wordcmp The above lexicon-generation method might make statisticians unhappy because, strictly speaking, we can't compare coefficients from different fitted models and draw conclusions about the strength of the effects they capture. However, a simple extension of our basic approach is pretty respectable. The basic strategy is to make pairwise comparisons in the model. To do this, create two collapsed frames pf1 and pf2 using ratingCollapsedFrame, bind them together with pf = rbind(pf1, pf2), and then extend the frame with pf$Stronger = pf$Word == as.character(pf1$Word[1]). Now use Stronger strategically in your model. (If you decide to pursue this question, perhaps write to me — I have some annotated data that can be used for assessment.)

Assessment

Having generated a lexicon, one naturally wonders how good it is. Luckily, there are existing pos/neg lexicons that we can compare against. Here, I use the very rich MPQA lexicon, which classifies words along two dimensions: positive/neutral/negative and weaksubj/strongsubj. This file is included in today's code distribution:

mpqa = read.csv('mpqa-lexicon.csv')
head(mpqa)
Word POS Polarity Subjectivity
1 icy a -1 1
2 hurt v -1 1
3 reprovingly v -1 2
4 easier r 1 1
5 influential a 1 1
6 insolently a -1 2

I've mapped the MPQA string Polarity values into numbers: -1 is negative, 0 is neutral, and 1 is positive. (I'll ignore the Subjectivity values, but I also mapped them to numbers: 1 is weaksubj and 2 is strongsubj.)

The MPQA's vocabulary is very large, and it includes words of different parts of speech. Our lexicon of ratings data contains only adjectives, so I'll restrict the frame to just the adjectival data:

mpqa.adj = subset(mpqa, POS=='a')

We're now in a position to compare our derived values against the MPQA. The first step is to add an MPQA column to our lexicon frame ratings.assess. I again call on ddply:

## Function for adding an MPQA value to a word's frame:
AddMpqa = function(pf) {
word = as.character(pf$Word)
if (word %in% mpqa.adj$Word) {
val = as.numeric(subset(mpqa.adj, Word==word)$Polarity)
}
else {
val = NA
}
pf$MPQA = val
return(pf)
}
## Now call ddply to extend the frame with this new column:
ratings.assess = ddply(ratings.assess, .variables=colnames(ratings.assess), .fun=AddMpqa, .progress='text')

The final step is to use our values to make polarity predictions. There are lots of ways we could do this. My initial proposal is to assume that words for which we don't have miniscule p-values are words that are neutral (Category doesn't predict their usage), and then to use the sign of the Category coefficient from there:

## Add the predictions:
ratings.assess$Prediction = ifelse(ratings.assess$P < 0.00001, 0,
ifelse(ratings.assess$CategoryCoef > 0, 1,
-1))
## Cross-tabulate the results for assessment:
x = xtabs(~ MPQA + Prediction, data=ratings.assess)
x
Prediction
MPQA -1 0 1
-1 108 39 26
0 8 2 5
1 56 41 81

The confusion matrix we just created has the gold-standard data going row-wise and our predictions going column-wise. The sum of the diagonals divided by the total is our accuracy:

sum(diag(x)) / sum(x)
[1] 0.5218579

Accuracy is not the best measure in this case because the gold-standard data are imbalanced, with very few neutral words:

apply(x, 1, sum)
-1 0 1
173 15 178

It is better to balance prediction and recall when thinking about how we did:

## Precision divides the true positives for category i by the sum of all positive cases of i:
Precision = function(x, i) { x[i,i] / sum(x[i, ]) }
> Precision(x, 1)
[1] 0.6242775
> Precision(x, 2)
[1] 0.1333333
> Precision(x, 3)
[1] 0.4550562
## Recall divides the true positives by the sum of all the guesses we made for category i:
Recall = function(x, i) { x[i,i] / sum(x[, i]) }
Recall(x, 1)
[1] 0.627907
Recall(x, 2)
[1] 0.02439024
Recall(x, 3)
[1] 0.7232143

Clearly, the neutral category is problematic for us. This is because our model predicts this category much more often than it is attested in MPQA. This seems like an inherent drawback to the assessment method: we just have different standards for neutrality at work. (Perhaps we would do better to assess only against the strongsubj cases.) However, the pos/neg confusions should be addressable. Let's look at a random sample of the errors from our worst category in terms of precision:

neg.fail = subset(ratings.assess, MPQA==-1 & MPQA != Prediction)
neg.fail[sample(nrow(neg.fail, 20)), ]
Word EC CategoryCoef P Tokencount Prediction MPQA
250 miserable -0.08575085 -1.62904156 1.047323e-11 9497 0 -1
276 poor -0.15553452 -2.17494454 0.000000e+00 141630 0 -1
26 angry -0.03425602 -0.60356329 2.301931e-12 39230 0 -1
43 audacious 0.15390341 0.22513033 8.629755e-01 1834 1 -1
247 melancholy 0.17546686 1.48608678 1.201083e-04 13232 1 -1
105 depressing 0.01700738 -0.91394117 1.480040e-10 31365 0 -1
337 tragic 0.20913108 0.39607406 9.764106e-03 35579 1 -1
303 scathing 0.14579717 0.09534574 9.149460e-01 2641 1 -1
15 aggressive 0.12895422 0.42992941 2.104427e-01 13515 1 -1
136 empty -0.03145483 -1.18175665 3.050219e-08 36553 0 -1
47 bad -0.12220433 -1.80590776 0.000000e+00 710896 0 -1
129 dull -0.12414360 -2.08129609 2.222769e-104 49574 0 -1
133 edgy 0.03534387 0.08147935 7.650492e-01 10522 1 -1
308 sharp 0.16568395 0.45674525 1.601153e-03 42703 1 -1
352 unpleasant -0.06834313 -1.66741984 2.343287e-07 9491 0 -1
301 sad 0.11151508 -0.34445251 2.198336e-10 122005 0 -1
361 weak -0.04389323 -1.35969281 2.694039e-98 81922 0 -1
98 dead 0.04158026 -0.64421607 2.909988e-34 189734 0 -1
67 brutal 0.16066174 0.43316829 1.356115e-03 36470 1 -1
257 nervous 0.22557651 0.31895246 3.455089e-01 12391 1 -1

If one samples around with ratingPlot, it quickly becomes clear that many of the failures are failures of our underlying model: words like sad are too weak to have truly significant linear relationships to the categories. They call for the quadratic model, making it all the more pressing that we figure out a way to bring those values into our lexicon generation method (exercise ex:quad; a hint: both the quadratic and the linear Category coefficients can be used as scores, and they can even be effectively compared, as long as they are normalized somewhat, since the quadratic scores are generally much smaller than the linear ones.)

Exercise ex:assess-ec Add a column ECPrediction to the ratings.assess frame and fill it with EC values, then re-run the assessment. How do the results compare to those obtained with the CategoryCoef values?

Exercise ex:subj The MPQA data.frame also supplies subjectivity values (1 is weaksubj and 2 is strongsubj). Do these value correlate with our sentiment scores?

Exercise ex:op Today's file also contains a CSV file containing the contents of Bing Liu's Opinion Lexicon:. Assess our own lexicon against this.

Bringing in context

In the section on Experience Project data, we used first logistic regression and then hierarchical logistic regression to model the relationships between words and contextual variables. I pursue a version of the hierarchical strategy here. The above models are insensitive to all kinds of context, but we've seen already that such variables matter. It's time to bring it in to our models and, in turn, into our lexicons!

ratings.R makes available a few hierarchical models. LmerWordType fits a hierarchical model that uses Category as the sole fixed-effects predictor and Type as its sole hierarchical predictor. This model delivers a fixed effects estimate for Category (a kind of weighted average over the different subclasses given by Type), and it also provides such estimates for the intercept and Category values for each value of Type. Thus, we have one pooled model (fixed effects) and 1 model for each of the 31 different Type values:

horrid = ratingFullFrame(d, word='horrid')
horrid.lmer = LmerWordType(horrid)
fixef(horrid.lmer)
(Intercept) Category
-7.565669 -2.644308
coef(horrid.lmer)
$Type
(Intercept) Category
Action -7.274226 -2.561226
Adventure -7.274037 -2.667250
Animation -7.147119 -2.682686
Apparel -7.409348 -2.620771
Beauty -7.270026 -2.441967
Books -7.422696 -2.313351
Comedy -7.331637 -2.906460
Computers -7.778343 -2.671019
Crime -7.273049 -2.341563
Documentary -7.208281 -1.889977
Drama -7.254646 -2.402365
DVD -7.146788 -2.230678
Electronics -7.902919 -2.277001
Home -8.307209 -2.358733
Horror -6.938018 -2.160266
Hotels -8.179016 -2.892307
Instruments -7.165105 -2.583995
Magazines -7.531645 -2.605249
Music -7.465000 -3.308469
Office Products -7.456904 -2.627931
Photography -8.287318 -3.209077
Restaurants -9.015850 -3.897194
Software -7.988622 -2.706323
Sports -7.953478 -2.928057
Technology -7.465764 -2.908992
Tools -7.730007 -2.751685
Video -7.153390 -2.379736
Video Games -7.107696 -2.492795

The function ratingPlotHierarchicalCoefficients plots the fixed and hierarchical coefficients. The result is a ranking, top to bottom, but with the labels jittered randomly left to right so that nearby values don't sit ontop of each other. The fixed-effect estimate is given as a big green dot:

par(mfrow=c(1,3))
ratingPlotHierarchicalCoefficients(d, 'horrid', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')
ratingPlotHierarchicalCoefficients(d, 'hilarious', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')
ratingPlotHierarchicalCoefficients(d, 'surprising', LmerWordType, fixed.coef='Category', hierarchical.coef='Type')

Figure fig:lmer-types

Hierarchical coefficient estimates for Type.

The model LmerWordModifier uses the same technique, but with Modifier as the hierarchical effect. The results provide a glimpse of the multifaceted ways in which adverbs modulate the meanings of the words they modify.

ratingPlotHierarchicalCoefficients(d, 'disappointing', LmerWordModifier, fixed.coef='Category', hierarchical.coef='Modifier')

Figure fig:lmer-mods

Hierarchical coefficient estimates for Modifier.

Exercise ex:modshapes The ModifierType values are shapes that correspond intuitively to the picture we get when regressing the frequency of Modifier on Category or Category². These provide rough semantic classifications, and the resulting models are more constrained than those obtained directly from the modifiers. ratings.R contains a model LmerWordModifierType for exploring these values with respect to specific words. Use ratingPlotHierarchicalCoefficients to get a feel for what the shapes are like and how they might be used in semantic/sentiment analysis.

Exercise ex:neg-estimates Use the above modeling techniques (any of them) to study the effects of negation. Are there generalizations we can make about how negation affects the sentiment scores we can derive from these data?

Exercise ex:larger The vocabulary of ratings-advadj.csv might start to feel confining after a little while. The file imdb-unigrams.csv in the code distribution for my SALT 20 paper is compatible with the above functions. It doesn't have the contextual features, but it has a truly massive vocabulary: potts-salt20-data-and-code.zip.

Exercise ex:assess-lmer As a first step towards building a context-aware lexicon, adapt the scale generation method of section 7 above to the hierarchical setting by using the fixef coefficients in place of the GLM ones, using one of the lmer models discussed above. Then assess the resulting lexicon against the MPQA as in section 8.