Multidimensional, social sentiment lexicons

  1. Overview
  2. Experience Project confessions
  3. Data and code
  4. Extracting subframes
  5. Basic word-level data
  6. Basic lexicon generation
  7. Bringing in the contextual features
    1. Case study: Age and drunk
    2. Context-sensitive lexicons

Overview

Sentiment analysis courses generally begin with techniques for classifying bits of text as either positive or negative. This is, of course, an important distinction, and useful for lots of applications, but I worry that it doesn't do justice to the blended, multidimensional nature of sentiment expression in language. It also tends to miss its social aspects.

In this first lecture, I hope to set a multidimensional, social tone, by looking at word-level data with complex, nuanced sentiment labels that are inherently relational. Our immediate goal will be to derive context-dependent lexicons from them. It's my hope that this provides insight into sentiment expression and sentiment analysis, and also that the lexicons prove useful later in the week.

If you haven't already, get set up with the course data and code.

Experience Project confessions

The Experience Project is a social networking website that allows users to share stories about their own personal experiences. At the confessions portion of the site, users write typically very emotional stories about themselves, and readers can then choose from among five reaction categories to the story, by clicking on one of the five icons in figure fig:ep_cats. The categories provide rich new dimensions of sentiment, ones that are generally orthogonal to the positive/negative one that most people study but that nonetheless model important aspects of sentiment expression and social interaction.

Figure fig:ep_cats
Experience Project categories. "You rock" is a positive exclamative category. "Teehee" is a playful, lighthearted category. "I understand" is an expression of solidarity. "Sorry, hugs" is a sympathetic category. And "Wow, just wow" is negative exclamative, the least used category on the site.
figures/ep/ep-reactions-snapshot.png

Table tab:ep provides example confessions along with the contextual data we have available. (These examples are less shocking/sad than the norm at this site — they are confessions, after all.)

Table tab:ep
Emotionally blended confessions from the Experience Project.
  1. I really hate being shy ... I just want to be able to talk to someone about anything and everything and be myself ... That's all I've ever wanted.
    • Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;
    • Author age: 21
    • Author gender:female
    • Text group: friends
  2. I bought a case of beer, now I'm watching a South Park marathon while getting drunk :P
    • Reactions: hugs: 2; rock: 3; teehee: 2, understand: 3, wow: 0
    • Author age: 25
    • Author gender:male
    • Text group: health

Data and code

Using the raw texts of the confessions and their associated metadata, I compiled a large table of statistics about individual words in their social context (to the extent that we can recover that context from the site). To load this data into R, move to the directory where you are storing your files for this course (see the Misc menu in R) and then run this command:

  1. ep = read.csv('ep3-context.csv')

The file is large, so this might take a little while to load. Once it has loaded, enter

  1. head(ep)

This displays the top six lines of the file, which should look like this:

  1. Word Category Group Gender Age Count Total
  2. 1 abandoned hugs sex male 2 1 25999
  3. 2 abandoned rock funny female 2 0 3141
  4. 3 abandoned understand offtopic female 1 3 42736
  5. 4 abandoned hugs family female 2 6 28671
  6. 5 abandoned teehee other unknown unknown 5 71067
  7. 6 abandoned understand other female 1 2 10915

Here's a rundown of the column values:

The Count and Total values are not token counts, but rather counts of the number of reaction categories chosen by readers at the site. The actual token counts generally provide a more intuitive estimate of the amount of data that we have about specific words in context, so I have put them in a separate file:

  1. eptok = read.csv('ep3-context-tokencounts.csv')
  2. head(eptok)
  3. Word Group Gender Age TokenCount
  4. 1 abandoned family female 5 2
  5. 2 abandoned embarrassing female 4 1
  6. 3 abandoned venting female 5 1
  7. 4 abandoned offtopic unknown unknown 4
  8. 5 abandoned love unknown 4 1
  9. 6 abandoned venting unknown 2 1

I've written a number of functions for working with data.frames like ep and eptok as loaded above. To make those available:

  1. source('ep.R')

Extracting subframes

The function epFullFrame allows you to pull out subframes of the larger ep frame based on specific values. Here are some examples of its use:

  1. pf = epFullFrame(ep, 'cool')
  2. pf = epFullFrame(ep, 'drunk', ages=1)
  3. pf = epFullFrame(ep, 'crazy', genders='male')
  4. pf = epFullFrame(ep, 'crazy', ages=1, genders='male')
  5. pf = epFullFrame(ep, c('gnarly','wicked'), ages=c(1,2,3))
  6. head(pf)
  7. Word Category Group Gender Age Count Total
  8. 370766 gnarly rock health female 2 0 4604
  9. 370767 gnarly hugs offtopic male 2 0 7422
  10. 370768 gnarly hugs health female 2 0 15228
  11. 370769 gnarly understand venting female 3 0 26740
  12. 370770 gnarly wow offtopic male 1 0 1315
  13. 370771 gnarly rock offtopic male 1 0 1339

The first argument is the data.frame we loaded before. The second argument is a word or a vector of words (given as quoted strings inside c()). The other arguments are given by keyword. They can come in any order. All of them can be specific values or vectors of values.

The above frames tend to be very large. You can collapse them down into a single five-line frame with epCollapsedFrame:

  1. pf = epCollapsedFrame(ep, 'cool')
  2. pf = epCollapsedFrame(ep, 'drunk', ages=1)
  3. pf = epCollapsedFrame(ep, 'crazy', genders='male')
  4. pf = epCollapsedFrame(ep, 'crazy', ages=1, genders='male')
  5. pf = epCollapsedFrame(ep, c('gnarly','wicked'), ages=c(1,2,3))
  6. pf
  7. Word Category Count Total
  8. 1 gnarly|wicked/teens|20s|30s hugs 14 516196
  9. 2 gnarly|wicked/teens|20s|30s rock 30 461418
  10. 3 gnarly|wicked/teens|20s|30s teehee 19 251035
  11. 4 gnarly|wicked/teens|20s|30s understand 17 562470
  12. 5 gnarly|wicked/teens|20s|30s wow 15 319949

This function groups all of your contextual variables together under Word and sums all the corresponding Count and Total values relative to Category. We'll make heavy use of this function when visualizing the data.

Basic word-level data

To start, let's ignore the contextual metadata and just study how the words relate to the Category values. I'll use funny as my running example:

  1. funny = epCollapsedFrame(ep, 'funny')
  2. funny
  3. Word Category Count Total
  4. 1 funny hugs 1730 2357657
  5. 2 funny rock 1664 1933477
  6. 3 funny teehee 1633 1101917
  7. 4 funny understand 2284 2708855
  8. 5 funny wow 1324 1711879

The Count value shows the number of times that stories containing funny received Category reactions. R makes it very easy to plot these two variables with respect to each other:

  1. plot(funny$Category, funny$Count, xlab='Category', ylab='Count', main='funny')
Figure ep:count
Raw Count values for funny.
figures/ep/funny-count.png

This is disappointing! The dominant category appears to be I understand, a sympathetic category. The runner-up is Sorry, hugs, another sympathetic category. This doesn't seem right; if I have occasion to use funny in a story, chances are it is an amusing story. Why isn't Teehee the most likely category?

The problem is immediately apparent if we plot the Total values, which give the overall usage of each of the five categories:

  1. plot(funny$Category, funny$Total, xlab='Category', ylab='Total', main='All of EP')
Figure ep:total
EP-wide Total values.
figures/ep/ep-total.png

Here, we see that the two sympathetic categories that dominated above are also over-represented here. This is explicable in terms of the dynamics of the site: people mostly write painful, heart-wrenching confessions, and the community is a sympathetic one. This is a nice little find, but we need to push past it if we are going to get at the lexical meanings involved, since it looks like even funny just returns the EP-wide ranking of categories.

To abstract away from category size, we divide the Count values by the Total values to get the relative frequency of clicks:

  1. funny$Count / funny$Total
  2. [1] 0.0007337793 0.0008606257 0.0014819628 0.0008431607
  3. [5] 0.0007734191

The epCollapsedFrame function will add these values automatically with the flag freqs=TRUE:

  1. funny = epCollapsedFrame(ep, 'funny', freqs=TRUE)
  2. funny
  3. Word Category Count Total Freq
  4. 1 funny hugs 1730 2357657 0.0007337793
  5. 2 funny rock 1664 1933477 0.0008606257
  6. 3 funny teehee 1633 1101917 0.0014819628
  7. 4 funny understand 2284 2708855 0.0008431607
  8. 5 funny wow 1324 1711879 0.0007734191

Plotting Category by Freq gives very intuitive results for funny, because we have overcome the influence of the underlying Category-size imbalances:

  1. plot(funny$Category, funny$Freq, xlab='Category', ylab='Count/Total', main='funny')
Figure ep:freq
Count/Total values for funny.
figures/ep/funny-freq.png

The Freq values can be thought of as conditional distributions of the form P(word|category), i.e., the probability of a speaker using word given that the reaction is category. These values are naturally very small (there are a lot of words to choose from). We can make them more intuitive by calculating a distribution P(category|word), i.e., the probability of category given that the speaker used word. This is a very natural perspective given the reaction-oriented nature of the metadata. The calculation is as follows:

  1. funny$Freq / sum(funny$Freq)
  2. 0.1563579 0.1833870 0.3157851 0.1796655 0.1648046

This is equivalent to an application of Bayes' Rule under the assumption of a uniform prior distribution P(category). (Without that uniformity assumption, we will reintroduce the Category bias that we are trying to avoid.)

The epCollapsedFrame function will add these values automatically with the flag probs=TRUE:

  1. funny = epCollapsedFrame(ep, 'funny', freqs=TRUE, probs=TRUE)
  2. funny
  3. Word Category Count Total Freq Pr
  4. 1 funny hugs 1730 2357657 0.0007337793 0.1563579
  5. 2 funny rock 1664 1933477 0.0008606257 0.1833870
  6. 3 funny teehee 1633 1101917 0.0014819628 0.3157851
  7. 4 funny understand 2284 2708855 0.0008431607 0.1796655
  8. 5 funny wow 1324 1711879 0.0007734191 0.1648046

Plotting the Pr values gives a distribution that is a simple rescaling of the previous one based on Freq, but the numbers are more intuitive:

  1. par(mfrow=c(1,2))
  2. plot(funny$Category, funny$Freq, xlab='Category', ylab='Count/Total', main='funny')
  3. plot(funny$Category, funny$Pr, xlab='Category', ylab='(Count/Total) / sum(Count/Total)', main='funny')
Figure ep:freq-pr
Frequency and probability values for funny.
figures/ep/funny-freq-pr.png

The plots so far are relatively utilitarian. I've included in ep.R a function epPlot that yields a lot more information in a way that I hope is more perspicuous:

  1. par(mfrow=c(1,2))
  2. epPlot(ep, eptok, 'funny')
  3. epPlot(ep, eptok, 'funny', probs=TRUE)
Figure ep:epplot:freq-pr
epPlot of probability values for funny.
figures/ep/epplot-funny-freq-pr.png

These plots include confidence intervals derived from the binomial. When probs=TRUE, they also give an informal null-hypothesis line at 1/5 = 0.20. As a first approximation, we might conjecture that any category with its confidence interval fully above 0.20 is over-represented and any with its confidence interval fully below 0.20 is under-represented. I think an even better way to get at this is to calculate observed/expected values for each category. These can be added to collapsed frames as follows:

  1. funny = epCollapsedFrame(ep, 'funny', freqs=TRUE, probs=TRUE, oe=TRUE)
  2. Word Category Count Total Freq Pr observed expected oe
  3. 1 funny hugs 1730 2357657 0.0007337793 0.1563579 1730 2074.1637 -0.16592889
  4. 2 funny rock 1664 1933477 0.0008606257 0.1833870 1664 1701.2042 -0.02186935
  5. 3 funny teehee 1633 1101917 0.0014819628 0.3157851 1633 970.1432 0.68325661
  6. 4 funny understand 2284 2708855 0.0008431607 0.1796655 2284 2383.3928 -0.04170224
  7. 5 funny wow 1324 1711879 0.0007734191 0.1648046 1324 1506.0960 -0.12090600

The observed values just repeat Count. The expected values are derived by multiplying the word's overall token count by the probability of each category:

  1. category.probs = (funny$Total/sum(funny$Total))
  2. funny.count = sum(funny$Count)
  3. funny.expected = funny.count * category.probs
  4. funny.expected
  5. [1] 2074.466 1701.237 969.560 2383.480 1506.256

The oe value is then the result of dividing observed by expected and subtracting 1 from the result:

  1. (funny$observed / funny.expected) - 1
  2. [1] -0.16605064 -0.02188818 0.68426917 -0.04173740 -0.12099951

Where O/E is above 0 for a category C, the word is over-represented for C. Where O/E is under 1 for C, the word is under-represented for C.

You can plot these values directly with epPlot:

  1. epPlot(ep, eptok, 'funny', oe=TRUE)

These plots also display statistics from the G-test. The G-test closely resembles the chi-squared test, except that it is done in log space. The test essentially compares the observed and expected values. The p value assesses the probability of the observed counts given that they are drawn from a multinomial distribution given by the expected counts. It provides some initial insight into whether we should draw any conclusions from the observed distribution. I emphasize that this is merely an initial insight. Our counts are so large that nearly every distribution is unlikely given the logic of the G-test, and we will be running lots of such tests, so its value is really in showing where have insufficient evidence. That is, where p is large, we probably can't draw conclusions. Where p is small, we have a promising clue, but we'll want to do more work.

Exercise ep:ex:plot Pick a coherent set of adjectives and use epPlot to get a feel for what the data are like and how the derived values behave. If you want to see all of them at once, use

  1. par(mfrow=c(2,2)) ## for where you want a 2 x 2 plotting window

and then call epPlot repeatedly.

Exercise ep:ex:catassoc For each category, try to find some words that strongly associate with that category, in the way we saw that funny associates with Teehee.

Basic lexicon generation

I now offer a simple method for generating a multidimensional lexicon using the above techniques:

First, we define an auxiliary function that, when given a subframe based on a specific Word value w, runs the G-test on the collapsed frame for w and returns the vector of O/E values along with the p value and the overall token count:

  1. ## You can paste this directly into your R buffer to make it available:
  2. epLexicon = function(pf) {
  3. pf = epCollapsedFrame(pf, pf$Word[1], freq=TRUE, oe=T)
  4. fit = epGTest(pf)
  5. return(c(pf$oe, fit$p.value, sum(pf$Count)))
  6. }

The plyr library allows us to apply epLexicon to the entire ep data.frame:

  1. ## Load the needed library:
  2. library(plyr)
  3. ## This step takes a while:
  4. eplex = ddply(ep, .variables=c('Word'), .fun=epLexicon, .progress='text')
  5. ## Add intuitive column names:
  6. colnames(eplex) = c('Word', levels(ep$Category), 'p.value', 'Tokencount')
  7. ## Check out the results:
  8. head(eplex)

Now we have some options. We can filter on p.value and/or Tokencount values, reduce negative values to 0, and so forth. Here's a function for seeing the top associates for a given category:

  1. epViewAssociates = function(lexdf, category, p=1, count=1) {
  2. sublex = subset(lexdf, p.value <= p & Tokencount >= count)
  3. return(sublex[order(sublex[, category], decreasing=TRUE), ])
  4. }

And a pretty restrictive call to it:

  1. head(epViewAssociates(eplex, 'hugs', p=0.001, count=50))
  2. Word hugs rock teehee understand wow p.value Tokencount
  3. 591 bittersweet 1.450935 -0.7122943 -0.07182327 -0.3886007 -0.5207253 0.000000e+00 987
  4. 82 adoptive 1.434472 -0.5396253 -0.48102157 -0.2106924 -0.6161864 4.600848e-06 55
  5. 668 bothersome 1.416726 -0.9134441 -0.83308446 0.4620579 -0.6227922 1.873651e-09 53
  6. 1392 danny 1.301143 -0.2690592 -0.53930258 -0.5039832 -0.1307639 4.625697e-08 98
  7. 6559 unbiased 1.296421 -0.4620813 -0.05717857 -0.3076902 -0.7185333 6.594959e-06 58
  8. 5587 sickly 1.197354 -0.4893234 -0.43696065 -0.1369859 -0.4553989 5.935578e-06 79

You might want to save your lexicon in a CSV file so that you can read it in later:

  1. write.csv(eplex, 'ep3-basic-lexicon.csv', row.names=F)

Exercise ep:ex:lex If you were able to generate the lexicon, use the above functions to study it.

Exercise ep:ex:lexprob What shortcomings is a lexicon like this likely to have, and how might we address those shortcomings?

Bringing in the contextual features

So far, we have ignored all the cool contextual features included in the ep data.frame. It's now time to bring them in, to create a more social, context-aware lexicon.

As a first pass towards understanding the connections, we do a bunch of epPlot calls for different values of the metadata we care about:

  1. par(mfrow=c(1,3))
  2. epPlot(ep, eptok, 'awesome', genders='male', probs=T)
  3. epPlot(ep, eptok, 'awesome', genders='female', probs=T)
  4. epPlot(ep, eptok, 'awesome', genders='unknown', probs=T)
Figure ep:awesome-gender
awesome by Gender.
figures/ep/epplot-awesome-gender.png

Or, for Age, excluding the unknown category:

  1. par(mfrow=c(2,3))
  2. for (i in 1:5) { epPlot(ep, eptok, 'awesome', ages=i, probs=T) }
Figure ep:awesome-age
awesome by Age.
figures/ep/epplot-awesome-age.png

I don't find these especially easy to parse. It's easy to see that the distributions differ, but fine-grained analysis of how they differ is hard. I think ep.R provides a better method. The function epCategoryByFactorPlot lets you visualize, for a given word, how the Category variables relate to a supplied piece of context data. For example:

  1. ## By gender:
  2. epCategoryByFactorPlot(ep, eptok, 'awesome', 'Gender', probs=T, type='b')
Figure ep:catby-gender
epCategoryByFactorPlot for awesome comparing Gender with the Category.
figures/ep/catby-awesome-gender.png
  1. ## Connect the lines with type='b' to see trends more easily:
  2. epCategoryByFactorPlot(ep, eptok, 'awesome', 'Age', probs=T, type='b')
Figure ep:catby-age
epCategoryByFactorPlot for awesome comparing Age with the Category.
figures/ep/catby-awesome-age.png

Exercise ep:ex:contexthighlight For each of the context variables we have (Age, Gender, Group), try to find words that highlight how these variables are important.

Exercise ep:ex:agerel Survey designers sometimes think of age as being quadratic in the sense that the very old and the very young pattern together for a variety of issues. Are there any words that manifest this parabolic pattern?

Exercise ep:ex:unk Each category has a sizable 'unknown' population, because users at the site are not required to supply these data points. Can we venture any tentative inferences about what this unknown population is like by studying the usage patterns?

Case study: Age and drunk

The following plot suggests that stories containing drunk are importantly influenced by the age of the author:

  1. epCategoryByFactorPlot(ep, eptok, 'drunk', 'Age', probs=T, type='b')
Figure ep:catby-drunk
epCategoryByFactorPlot for drunk comparing Age with the Category .
figures/ep/catby-drunk-age.png

I propose two regression-based methods for understanding these relationships more deeply. To start, let's fit a simple generalized linear model (GLM) that uses both Category and Age to predict the log-frequency of drunk:

  1. ## Get the frame:
  2. drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5))
  3. ## Ensure that Age is numeric:
  4. drunk $Age = as.numeric(drunk $Age)
  5. ## Fit the model:
  6. fit.glm = glm(cbind(Count,Total-Count) ~ Category - 1 + Age, data=drunk, family=binomial)
  7. summary(fit.glm)
  8. ## I abbreviate the output of summary to avoid distractions:
  9. ...
  10. Coefficients:
  11. Estimate Std. Error z value Pr(>|z|)
  12. Categoryhugs -6.72485 0.04789 -140.414 < 2e-16 ***
  13. Categoryrock -6.82601 0.05615 -121.570 < 2e-16 ***
  14. Categoryteehee -6.38640 0.05720 -111.659 < 2e-16 ***
  15. Categoryunderstand -7.18148 0.05260 -136.534 < 2e-16 ***
  16. Categorywow -6.61393 0.05758 -114.859 < 2e-16 ***
  17. Age -0.08736 0.01477 -5.913 3.35e-09 ***
  18. ---
  19. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  20. ...

In this model, I subtracted 1 from the Category coefficient, which corresponds to removing the Intercept term. Without this, R would pick one of the Category values (probably hugs, since it is alphabetically first) as a reference category, which would make the model hard to interpret.

Now, the above model is hard to interpret anyway, but we can make it more intuitive by thinking of it as a function that predicts frequency based on which category we're in and what age our author is:

  1. ## Calculate fitted values.
  2. ## fit: fitted model of the form P(word) ~ Category - 1 + Age
  3. ## category: one of the reaction categories
  4. ## age: an integer
  5. FittedGlmFunc = function(fit, category, age) {
  6. ## Extract the coefficients:
  7. coefs = fit$coef
  8. ## Get the category coefficient:
  9. cat.coef = coefs[[paste('Category',category, sep='')]]
  10. ## Get the fitted value; plogis is the inverse logit function,
  11. ## which maps the weights from the regression to frequencies:
  12. prediction = plogis(cat.coef + coefs[['Age']]*age)
  13. return(prediction)
  14. }

A few examples:

  1. FittedGlmFunc(fit.glm, 'wow', 1)
  2. [1] 0.001227812
  3. FittedGlmFunc(fit.glm, 'wow', 2)
  4. [1] 0.001125213
  5. FittedGlmFunc(fit.glm, 'rock', 2)
  6. [1] 0.0009103791
  7. FittedGlmFunc(fit.glm, 'rock', 5)
  8. [1] 0.0007006292

The following plot systematically compares the fitted values with the empirical ones, using epPlot:

  1. par(mfrow=c(2,3))
  2. cats = levels(ep$Category)
  3. for(i in 1:5) {
  4. epPlot(ep, eptok, 'drunk', age=i)
  5. for (j in 1:5) {
  6. val = FittedGlmFunc(fit.glm, cats[j], i)
  7. points(j, val, col='red', pch=19)
  8. }
  9. }
Figure ep:glm-drunk
Comparing the empirical values (black) with fitted values (red) for drunk by Age.
figures/ep/glm-drunk.png

The second method I propose uses a hierarchical logistic regression in which the only fixed-effect predictor is Category and Age is a hierarchical predictor. This is very similar to what we did above, but it should provide better by-Age estimates and it will give us more flexibility when building lexicons.

In brief, a hierarchical regression model is one in which there are two kinds of predictor: fixed-effects terms, which are analagous to the predictors from our GLM above, and hierarchical terms, which group the data into subsets. The coefficient values are fit via an iterative process that goes back and forth between the fixed effects and the hierarchical ones, continually reestimating the values until they stabilize. In practical terms, this tends to give better estimates for subgroups for which we have relatively little data, because those subgroup estimates benefit from those of the full population.

The only down-sides I see are that the models are computationally expensive to fit and we can't treat Age as continuous anymore.

The basic model specification looks very much like the one we use before with glm, except that now we repeat the fixed-effects terms inside parenthesis, before | Age

  1. ## Load the library for doing hierarchical modeling:
  2. library(lme4)
  3. fit.lmer = lmer(cbind(Count,Total-Count) ~ Category - 1 + (Category - 1 | Age), data=drunk, family=binomial)

The coefficients are all significant, so let's concentrate on what they say. As before, they are somewhat more intuitive when converted to frequencies:

  1. plogis(fixef(fit.lmer))
  2. Categoryhugs Categoryrock Categoryteehee Categoryunderstand Categorywow
  3. 0.0008895718 0.0006963748 0.0009679585 0.0005483507 0.0009364059

The fixed effects estimates are a kind of weighted average of the hierarchical ones, with larger hierarchical categories contributing more than smaller ones. They can be used where no age information is available.

It's helpful to see these values plotted next to the empirical values:

  1. epPlot(ep, eptok, 'drunk')
  2. points(plogis(fixef(fit.lmer)), col='green', pch=19, lwd=5)
Figure ep:lmer-fixed
Fixed-effects estimates for drunk (green) along with the empirical frequencies (black).
figures/ep/lmer-drunk-fixed.png

Based on this plot, we might say that drunk stories are either especially heart-wrenching or especially funny. Both our empirical estimate and the fitted model agree on this. However, we can see from our earlier breakdown that this isn't an "either/or" choice. Rather, it is conditioned in part by age. The hierarchical estimates bring this out well. To extract them:

  1. hier = coef(fit.lmer)$Age
  2. hier
  3. Categoryhugs Categoryrock Categoryteehee Categoryunderstand Categorywow
  4. 1 -7.049464 -7.802812 -7.308462 -7.719356 -6.999898
  5. 2 -6.910581 -6.842171 -6.462696 -7.293635 -6.755380
  6. 3 -6.793853 -7.078609 -6.375680 -7.324456 -6.595666
  7. 4 -7.073637 -7.248539 -7.024493 -7.524842 -7.126939
  8. 5 -7.286526 -7.356749 -7.505571 -7.669464 -7.374190
  9. ## Let's map them to frequencies too:
  10. hier = plogis(as.matrix(hier))
  11. Categoryhugs Categoryrock Categoryteehee Categoryunderstand Categorywow
  12. 1 0.0008671216 0.0004084175 0.0006693980 0.0004439494 0.0009111442
  13. 2 0.0009961851 0.0010666443 0.0015581505 0.0006793903 0.0011632426
  14. 3 0.0011193884 0.0008422345 0.0016995674 0.0006587838 0.0013644127
  15. 4 0.0008464290 0.0007107076 0.0008890272 0.0005392229 0.0008025299
  16. 5 0.0006842344 0.0006378633 0.0005497091 0.0004666501 0.0006268412

We can again plot these values and compare them to their empirical estimates:

  1. par(mfrow=c(2,3))
  2. for (i in 1:5) {
  3. epPlot(ep, eptok, 'drunk', age=i)
  4. points(seq(1,5), plogis(fixef(fit.lmer)), col='green', pch=19, lwd=5)
  5. points(seq(1,5), hier[i, ], col='red', pch=19, lwd=5)
  6. }
Figure ep:lmer-drunk
Comparing the empirical values (black) with fixed effects estimates (green) and hierarchical estimates (red) for drunk by Age.
figures/ep/lmer-drunk-hier.png

Exercise ep:adapt Adapt the above code so that you can run additional case studies. Really, this just involves turning the model fitting steps into a function, and perhaps writing a few versions of FittedGlmFunc for different contextual variables.

Context-sensitive lexicons

The above case study suggests a general method for building context-sensitive lexicons. Here is the procedure, focussing on sensitivity to author age:

  1. For each word w in the vocabulary, fit an lmer model for w with Category as the fixed-effects predictor and Age as the hierarchical predictor.
  2. Optionally filter using the p-values to help avoid giving sentiment scores where there isn't sufficient evidence to support them.
  3. Use the coefficients from the model as sentiment scores: where age information is available, use the hierarchical estimates, else use the fixed effects estimates.

This is very computationally intensive, as you probably gathered if you waited around for your own lmer models to converge. However, it can be parallelized if necessary, and it need only be done once.

Exercise ex:larger The small vocabulary of ep3-context.csv might start to feel confining after a little while. The file epconfessions-unigrams.csv in the code distribution for my SALT 20 paper is compatible with the above functions. It doesn't have the contextual features, but it has a truly massive vocabulary: potts-salt20-data-and-code.zip. The distribution doesn't contain a token-counts file, but you can download one here and treat it like eptok above: epconfessions-tokencounts-unigrams.csv.zip.