Sentiment analysis courses generally begin with techniques for classifying bits of text as either positive or negative. This is, of course, an important distinction, and useful for lots of applications, but I worry that it doesn't do justice to the blended, multidimensional nature of sentiment expression in language. It also tends to miss its social aspects.
In this first lecture, I hope to set a multidimensional, social tone, by looking at word-level data with complex, nuanced sentiment labels that are inherently relational. Our immediate goal will be to derive context-dependent lexicons from them. It's my hope that this provides insight into sentiment expression and sentiment analysis, and also that the lexicons prove useful later in the week.
If you haven't already, get set up with the course data and code.
The Experience Project is a social networking website that allows users to share stories about their own personal experiences. At the confessions portion of the site, users write typically very emotional stories about themselves, and readers can then choose from among five reaction categories to the story, by clicking on one of the five icons in figure fig:ep_cats. The categories provide rich new dimensions of sentiment, ones that are generally orthogonal to the positive/negative one that most people study but that nonetheless model important aspects of sentiment expression and social interaction.
Table tab:ep provides example confessions along with the contextual data we have available. (These examples are less shocking/sad than the norm at this site — they are confessions, after all.)
Using the raw texts of the confessions and their associated metadata, I compiled a large table of statistics about individual words in their social context (to the extent that we can recover that context from the site). To load this data into R, move to the directory where you are storing your files for this course (see the Misc menu in R) and then run this command:
The file is large, so this might take a little while to load. Once it has loaded, enter
This displays the top six lines of the file, which should look like this:
Here's a rundown of the column values:
The Count and Total values are not token counts, but rather counts of the number of reaction categories chosen by readers at the site. The actual token counts generally provide a more intuitive estimate of the amount of data that we have about specific words in context, so I have put them in a separate file:
I've written a number of functions for working with data.frames like ep and eptok as loaded above. To make those available:
The function epFullFrame allows you to pull out subframes of the larger ep frame based on specific values. Here are some examples of its use:
The first argument is the data.frame we loaded before. The second argument is a word or a vector of words (given as quoted strings inside c()). The other arguments are given by keyword. They can come in any order. All of them can be specific values or vectors of values.
The above frames tend to be very large. You can collapse them down into a single five-line frame with epCollapsedFrame:
This function groups all of your contextual variables together under Word and sums all the corresponding Count and Total values relative to Category. We'll make heavy use of this function when visualizing the data.
To start, let's ignore the contextual metadata and just study how the words relate to the Category values. I'll use funny as my running example:
The Count value shows the number of times that stories containing funny received Category reactions. R makes it very easy to plot these two variables with respect to each other:
This is disappointing! The dominant category appears to be I understand, a sympathetic category. The runner-up is Sorry, hugs, another sympathetic category. This doesn't seem right; if I have occasion to use funny in a story, chances are it is an amusing story. Why isn't Teehee the most likely category?
The problem is immediately apparent if we plot the Total values, which give the overall usage of each of the five categories:
Here, we see that the two sympathetic categories that dominated above are also over-represented here. This is explicable in terms of the dynamics of the site: people mostly write painful, heart-wrenching confessions, and the community is a sympathetic one. This is a nice little find, but we need to push past it if we are going to get at the lexical meanings involved, since it looks like even funny just returns the EP-wide ranking of categories.
To abstract away from category size, we divide the Count values by the Total values to get the relative frequency of clicks:
The epCollapsedFrame function will add these values automatically with the flag freqs=TRUE:
Plotting Category by Freq gives very intuitive results for funny, because we have overcome the influence of the underlying Category-size imbalances:
The Freq values can be thought of as conditional distributions of the form P(word|category), i.e., the probability of a speaker using word given that the reaction is category. These values are naturally very small (there are a lot of words to choose from). We can make them more intuitive by calculating a distribution P(category|word), i.e., the probability of category given that the speaker used word. This is a very natural perspective given the reaction-oriented nature of the metadata. The calculation is as follows:
This is equivalent to an application of Bayes' Rule under the assumption of a uniform prior distribution P(category). (Without that uniformity assumption, we will reintroduce the Category bias that we are trying to avoid.)
The epCollapsedFrame function will add these values automatically with the flag probs=TRUE:
Plotting the Pr values gives a distribution that is a simple rescaling of the previous one based on Freq, but the numbers are more intuitive:
The plots so far are relatively utilitarian. I've included in ep.R a function epPlot that yields a lot more information in a way that I hope is more perspicuous:
These plots include confidence intervals derived from the binomial. When probs=TRUE, they also give an informal null-hypothesis line at 1/5 = 0.20. As a first approximation, we might conjecture that any category with its confidence interval fully above 0.20 is over-represented and any with its confidence interval fully below 0.20 is under-represented. I think an even better way to get at this is to calculate observed/expected values for each category. These can be added to collapsed frames as follows:
The observed values just repeat Count. The expected values are derived by multiplying the word's overall token count by the probability of each category:
The oe value is then the result of dividing observed by expected and subtracting 1 from the result:
Where O/E is above 0 for a category C, the word is over-represented for C. Where O/E is under 1 for C, the word is under-represented for C.
You can plot these values directly with epPlot:
These plots also display statistics from the G-test. The G-test closely resembles the chi-squared test, except that it is done in log space. The test essentially compares the observed and expected values. The p value assesses the probability of the observed counts given that they are drawn from a multinomial distribution given by the expected counts. It provides some initial insight into whether we should draw any conclusions from the observed distribution. I emphasize that this is merely an initial insight. Our counts are so large that nearly every distribution is unlikely given the logic of the G-test, and we will be running lots of such tests, so its value is really in showing where have insufficient evidence. That is, where p is large, we probably can't draw conclusions. Where p is small, we have a promising clue, but we'll want to do more work.
Exercise ep:ex:plot Pick a coherent set of adjectives and use epPlot to get a feel for what the data are like and how the derived values behave. If you want to see all of them at once, use
and then call epPlot repeatedly.
Exercise ep:ex:catassoc For each category, try to find some words that strongly associate with that category, in the way we saw that funny associates with Teehee.
I now offer a simple method for generating a multidimensional lexicon using the above techniques:
First, we define an auxiliary function that, when given a subframe based on a specific Word value w, runs the G-test on the collapsed frame for w and returns the vector of O/E values along with the p value and the overall token count:
The plyr library allows us to apply epLexicon to the entire ep data.frame:
Now we have some options. We can filter on p.value and/or Tokencount values, reduce negative values to 0, and so forth. Here's a function for seeing the top associates for a given category:
And a pretty restrictive call to it:
You might want to save your lexicon in a CSV file so that you can read it in later:
Exercise ep:ex:lex If you were able to generate the lexicon, use the above functions to study it.
Exercise ep:ex:lexprob What shortcomings is a lexicon like this likely to have, and how might we address those shortcomings?
So far, we have ignored all the cool contextual features included in the ep data.frame. It's now time to bring them in, to create a more social, context-aware lexicon.
As a first pass towards understanding the connections, we do a bunch of epPlot calls for different values of the metadata we care about:
Or, for Age, excluding the unknown category:
I don't find these especially easy to parse. It's easy to see that the distributions differ, but fine-grained analysis of how they differ is hard. I think ep.R provides a better method. The function epCategoryByFactorPlot lets you visualize, for a given word, how the Category variables relate to a supplied piece of context data. For example:
Exercise ep:ex:contexthighlight For each of the context variables we have (Age, Gender, Group), try to find words that highlight how these variables are important.
Exercise ep:ex:agerel Survey designers sometimes think of age as being quadratic in the sense that the very old and the very young pattern together for a variety of issues. Are there any words that manifest this parabolic pattern?
Exercise ep:ex:unk Each category has a sizable 'unknown' population, because users at the site are not required to supply these data points. Can we venture any tentative inferences about what this unknown population is like by studying the usage patterns?
The following plot suggests that stories containing drunk are importantly influenced by the age of the author:
I propose two regression-based methods for understanding these relationships more deeply. To start, let's fit a simple generalized linear model (GLM) that uses both Category and Age to predict the log-frequency of drunk:
In this model, I subtracted 1 from the Category coefficient, which corresponds to removing the Intercept term. Without this, R would pick one of the Category values (probably hugs, since it is alphabetically first) as a reference category, which would make the model hard to interpret.
Now, the above model is hard to interpret anyway, but we can make it more intuitive by thinking of it as a function that predicts frequency based on which category we're in and what age our author is:
A few examples:
The following plot systematically compares the fitted values with the empirical ones, using epPlot:
The second method I propose uses a hierarchical logistic regression in which the only fixed-effect predictor is Category and Age is a hierarchical predictor. This is very similar to what we did above, but it should provide better by-Age estimates and it will give us more flexibility when building lexicons.
In brief, a hierarchical regression model is one in which there are two kinds of predictor: fixed-effects terms, which are analagous to the predictors from our GLM above, and hierarchical terms, which group the data into subsets. The coefficient values are fit via an iterative process that goes back and forth between the fixed effects and the hierarchical ones, continually reestimating the values until they stabilize. In practical terms, this tends to give better estimates for subgroups for which we have relatively little data, because those subgroup estimates benefit from those of the full population.
The only down-sides I see are that the models are computationally expensive to fit and we can't treat Age as continuous anymore.
The basic model specification looks very much like the one we use before with glm, except that now we repeat the fixed-effects terms inside parenthesis, before | Age
The coefficients are all significant, so let's concentrate on what they say. As before, they are somewhat more intuitive when converted to frequencies:
The fixed effects estimates are a kind of weighted average of the hierarchical ones, with larger hierarchical categories contributing more than smaller ones. They can be used where no age information is available.
It's helpful to see these values plotted next to the empirical values:
Based on this plot, we might say that drunk stories are either especially heart-wrenching or especially funny. Both our empirical estimate and the fitted model agree on this. However, we can see from our earlier breakdown that this isn't an "either/or" choice. Rather, it is conditioned in part by age. The hierarchical estimates bring this out well. To extract them:
We can again plot these values and compare them to their empirical estimates:
Exercise ep:adapt Adapt the above code so that you can run additional case studies. Really, this just involves turning the model fitting steps into a function, and perhaps writing a few versions of FittedGlmFunc for different contextual variables.
The above case study suggests a general method for building context-sensitive lexicons. Here is the procedure, focussing on sensitivity to author age:
This is very computationally intensive, as you probably gathered if you waited around for your own lmer models to converge. However, it can be parallelized if necessary, and it need only be done once.
Exercise ex:larger The small vocabulary of ep3-context.csv might start to feel confining after a little while. The file epconfessions-unigrams.csv in the code distribution for my SALT 20 paper is compatible with the above functions. It doesn't have the contextual features, but it has a truly massive vocabulary: potts-salt20-data-and-code.zip. The distribution doesn't contain a token-counts file, but you can download one here and treat it like eptok above: epconfessions-tokencounts-unigrams.csv.zip.