Thumbs Up? Sentiment Classification Like it’s 2002


May 25, 2023


In July 2002 Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan published Thumbs up? Sentiment Classification using Machine Learning Techniques. at EMNLP, one of the earliest works of using machine learning for Sentiment Classification. It was an influential paper, winning a test of time award at NAACL 2018, and at the time of writing has over 11,000 citations. This work led to their follow up Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales and this dataset was the basis for the Stanford Sentiment Treebank dataset released in Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Socher et al., which is widely used partly because of it’s inclusion in GLUE.

This paper aims to show that classifying the sentiment of movie reviews is a more challenging problem to develop machine learning techniques on than the existing topic classification problems, and motivate further work (in which they were successful!) They do this by building a self-labelled dataset of polar movie reviews from Usenet and then show baseline classifiers don’t work as well as existing topic classification datasets.

This notebook aims to explore the paper and its methods in more detail, and the headings follow the paper section by section. We go much deeper into the data than the paper, and reproduce their methods, and get similar (but slightly better) results. A good future work would be to look into applying more modern methods on this dataset.

The Movie Review Domain

They took reviews from the Internet Movie Database (IMDb) archive of the, took the reviews with a numerical or star rating and labelled the highest scored ones positive, the lowest negative, and removed the rest.

The IMDb archive no longer exists, but there are current archives of this newsgroup in Google Groups and the Usenet Archives. Thankfully the authors released their original data both the raw HTML they extracted and the extracted text they used for classification.

Let’s take a look at the HTML to see what they worked with

from urllib.request import urlretrieve
from pathlib import Path
from zipfile import ZipFile
import tarfile
import re

data_dir = Path('data')

source_html_url = ''

raw_html_path = data_dir / ''
if not raw_html_path.exists():
    urlretrieve(source_html_url, raw_html_path)
raw_html_zip = ZipFile(raw_html_path)

The zipfile contains a single directory movie containing around 27k review files

[<ZipInfo filename='movie/0002.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=4415 compress_size=2170>,
 <ZipInfo filename='movie/0003.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=2702 compress_size=1398>,
 <ZipInfo filename='movie/0004.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=6165 compress_size=3059>,
 <ZipInfo filename='movie/0005.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=4427 compress_size=2103>,
 <ZipInfo filename='movie/0006.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=6423 compress_size=3225>]
[<ZipInfo filename='movie/9995.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=5232 compress_size=2643>,
 <ZipInfo filename='movie/9997.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=10113 compress_size=4812>,
 <ZipInfo filename='movie/9998.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=3868 compress_size=1935>,
 <ZipInfo filename='movie/9999.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=3081 compress_size=1605>,
 <ZipInfo filename='movie/' filemode='drwxrwxrwx' external_attr=0x10>]

Let’s have a look at one of them (that’s not too long, and recent enough to be in other archives); you could also see it on the Usenet Archives or Google Groups.

Note that the original was almost certainly a plaintext email; some the HTML markup (in particular the footer) would have been added by IMDB. Note that the rating is stated twice in the review as a “low 0” on a scale from -4 to 4; this begins to indicate the difficulty of automatically extracting the ratings.

CODEC = 'ISO-8859-1'
movie_review_html ='movie/0908.html').decode(CODEC)
<TITLE>Review for Flight of the Intruder (1990)</TITLE>
<LINK REL="STYLESHEET" TYPE="text/css" HREF="/ramr.css">
<H1 ALIGN="CENTER" CLASS="title"><A HREF="/Title?0099587">Flight of the Intruder (1990)</A></H1><H3 ALIGN=CENTER>reviewed by<BR><A HREF="/ReviewsBy?Mark+R.+Leeper">Mark R. Leeper</A></H3><HR WIDTH="40%" SIZE="4">
<PRE>                            FLIGHT OF THE INTRUDER
                       A film review by Mark R. Leeper
                        Copyright 1991 Mark R. Leeper</PRE>
<P>          Capsule review:  Pretty pictures, stupid story.  The
     air-war of a previous conflict is occasionally entertaining
     to watch but the plot is cliched as are most of the
     characters.  This film's only chance is to follow the current
     wave of interest in military equipment.  Rating: low 0.</P>
<P>     Had I not actually seen a copy of the book FLIGHT OF THE INTRUDER by
Stephen Coonts, I would have had a hard time telling if this was a very weak
story given classy military equipment photography and quality special
effects treatment or if this was just a collection of classy military
equipment photography and quality special effects tied together by a very
weak excuse for a story.  During World War II a lot of B war movies carried
stories just as good to the bottom half of double bills.  We are talking
HELLCATS OF THE NAVY-level plotting here.  In 1972 Vietnam we have an
aircraft carrier ruled over by a cigar-chewing, mean-as-a-junkyard-dog-but-
heart-of-gold sort of commander.  Danny Glover plays the Black commander
with the unlikely name Frank Camparelli.  One of his bright young pilots,
Jake Grafton (played by the uninteresting Brad Johnson) agonizes over the
loss of his bombardier.  The companion is lost in a raid that accomplishes
nothing besides adding visual interest to the opening credits.  Grafton
wants to go on a super-special raid of his own devising.  But this raid is
directly contrary to orders.  His top-gun replacement bombardier Virgil Cole
(played by Willem Dafoe) says absolutely not.  Does Jake get to make his
super-special raid on North Vietnam?  And if he does, what is the Navy's
<P>     The weak story is, however, punctuated by pretty pictures of planes,
helicopters, and aircraft carriers to keep the audience watching.  If this
film stands any chance with audiences it is in the fortuitous timing of this
film coincident with a sudden upsurge of interest in technical weaponry.
Indeed many people may find events in the Middle East resonating with
attitudes in this film.  On the other hand, maybe some people would prefer
to stay home and watch technical weaponry on television.</P>
<P>     FLIGHT OF THE INTRUDER is directed by John Milius, who is specializing
in gutsy films like APOCALYPSE NOW (which he wrote), CONAN THE BARBARIAN,
and RED DAWN.  The score is by Basil Poledouris, the gifted composer of the
scores for the "Conan" films, who seems repeatedly associated with films
with right-wing themes.  Poledouris scored RED DAWN, AMERIKA, and THE HUNT
<P>     FLIGHT OF THE INTRUDER is linked in advertising with THE HUNT FOR RED
OCTOBER, but it falls well short of that film's interest value and quality.
My rating is a low 0 on the -4 to +4 scale.</P>
<PRE>                                        Mark R. Leeper
                                        <A HREF=""></A>
<HR><P CLASS=flush><SMALL>The review above was posted to the
<A HREF=""></A> newsgroup (<A HREF=""></A> for German reviews).<BR>
The Internet Movie Database accepts no responsibility for the contents of the
review and has no editorial control. Unless stated otherwise, the copyright
belongs to the author.<BR>
Please direct comments/criticisms of the review to relevant newsgroups.<BR>
Broken URLs inthe reviews are the responsibility of the author.<BR>
The formatting of the review is likely to differ from the original due
to ASCII to HTML conversion.
<P ALIGN=CENTER>Related links: <A HREF="/Reviews/">index of all reviews</A></P>


For convenience let’s define a function to read the HTML of a given movie

def get_movie_html(movieid):
<TITLE>Review for Flight of the Intruder (1990)</TITLE>
<LINK REL="STYLESHEET" TYPE="text/css" HREF="/ramr.css">
<H1 ALIGN="CENTER" CLASS="title"><A HREF="/Title?0099587">Flight of the Intruder (1990)</A></H1><H3 ALIGN=CENTER>reviewed by<BR><A HREF="/ReviewsBy?Mark+R.+Leeper">Mark R. Leeper</A></H3><HR WIDTH="40%" SIZE="4">
<PRE>                            FLIGHT OF THE INTRUDER
                       A film review by Mark R. Leeper
                        Copyright 1991 Mark R. Leeper</PRE>
<P>          Capsule review:  Pretty pictures, stupid story.  The
     air-war of a previous conflict is occasionally entertaining
     to watch but the plot is cliched as are most of the
     characters.  This film's only chance is to follow the current
     wave of interest in military equipment.  Rating: low 0.</P>
<P>     Had I not actually seen a copy of the book FLIGHT OF THE INTRUDER by
Stephen Coonts, I would have h

Cleaned Text

And the cleaned and labelled text we’ll get version 1.1 which according to the README has some corrections over the version used in the paper.

sentiment_url = ''
sentiment_path = data_dir / 'sentiment.tar.gz'

if not sentiment_path.exists():
    urlretrieve(sentiment_url, sentiment_path)
urlretrieve(sentiment_url, sentiment_path)

sentiment_fh =


  • a diff.txtthat says what changed between versions
  • a README describing the dataset
  • subfolders neg/ and pos/ containing negative and positive reviews

We can extract the label (pos or neg) the cross-validation id (cvid), and the movie id from the filename with a regular expression

pattern = re.compile(r'^tokens/(?P<label>[^/]+)/cv(?P<cvid>[0-9]+)_tok-(?P<movieid>[0-9]+).txt$')

{'label': 'neg', 'cvid': '303', 'movieid': '11557'}

Let’s extract all the data into aligned lists

data = {
    'label': [],
    'cvid': [],
    'movieid': [],
    'text': []

for member in sentiment_fh:
    match = pattern.match(
    if not match:
        print('Skipping %s' %
    for k, v in match.groupdict().items():
{k: len(v) for k,v in data.items()}
Skipping diff.txt
Skipping README
Skipping tokens
Skipping tokens/neg
Skipping tokens/pos
{'label': 1386, 'cvid': 1386, 'movieid': 1386, 'text': 1386}

To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 reviews per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.

we randomly selected 700 positive-sentiment and 700 negative-sentiment documents

We get slightly fewer likely due to the updates since it was first released

from collections import Counter

Counter({'neg': 692, 'pos': 694})

Data Lengths

These reviews can be quite long, and the tokenization of punctuation is quite aggressive; how long are the actual tokens?

lengths = [len(text.split()) for text in data['text']]
def median(l):
    return sorted(l)[len(l)//2]

def mean(l):
    return sum(l)/len(l)

The median length of reviews is around 700

median(lengths), mean(lengths)

Note that negative reviews tend to be a little shorter than positive reviews.

neg_lengths = [x for x, l in zip(lengths, data['label']) if l == 'neg']
median(neg_lengths), mean(neg_lengths)
(681, 715.5664739884393)
pos_lengths = [x for x, l in zip(lengths, data['label']) if l == 'pos']
median(pos_lengths), mean(pos_lengths)
(735, 798.0345821325649)

We can use this to get a few percentage points without even looking at the data

preds = ['pos' if l > 730 else 'neg' for l in lengths]
def accuracy(preds, actuals):
    if len(preds) != len(actuals):
        raise ValueError('Expected same length input')
    return mean([p==a for p,a in zip(preds, actuals)])

accuracy(preds, data['label'])

A Closer Look at the Problem

From the paper they set a benchmark using manual lists of positive and negative words:

One might also suspect that there are certain words people tend to use to express strong sentiments, so that it might suffice to simply produce a list of such words by introspection and rely on them alone to classify the texts.

To test this latter hypothesis, we asked two graduate students in computer science to (independently) choose good indicator words for positive and negative sentiments in movie reviews.

Extracting these from Figure 1

positive1 = 'dazzling, brilliant, phenomenal, excellent, fantastic'.split(', ')
['dazzling', 'brilliant', 'phenomenal', 'excellent', 'fantastic']
negative1 = 'suck, terrible, awful, unwatchable, hideous'.split(', ')
['suck', 'terrible', 'awful', 'unwatchable', 'hideous']
positive2 = 'gripping, mesmerizing, riveting, spectacular, cool, ' \
            'awesome, thrilling, badass, excellent, moving, exciting'.split(', ')
negative2 = 'bad, cliched, sucks, boring, stupid, slow'.split(', ')
['bad', 'cliched', 'sucks', 'boring', 'stupid', 'slow']

We then converted their responses into simple decision procedures that essentially count the number of the proposed positive and negative words in a given document.

We can build a small classifier to do this. When there’s a tie we need to decide how to break it with a “default”.

From the paper

Note that the tie rates — percentage of documents where the two sentiments were rated equally likely — are quite high (we chose a tie breaking policy that maximized the accuracy of the baselines)

We need to look at the data for this.

idx2cat = ['neg', 'pos']
cat2idx = {'neg': 0, 'pos': 1}
class NotFittedException(Exception):

class MatchCountClassifier:
    def __init__(self, positive, negative):
        self.positive = positive
        self.negative = negative
        self.default = None
        self.ties = None
    def _score(self, tokens):
        """Return number of positive words - number of negative words in token"""
        pos_count = len([t for t in tokens if t in self.positive])
        neg_count = len([t for t in tokens if t in self.negative])
        return pos_count - neg_count
    def fit(self, X, y):
        """Find default that maximises """
        scores = [self._score(tokens) for tokens in X]
        self.ties = len([x for x in scores if x==0]) / len(scores)
        pred_pos_default = [1 if x >= 0 else 1 for x in scores]
        pred_neg_default = [0 if x <= 0 else 0 for x in scores]
        if accuracy(pred_pos_default, y) >= accuracy(pred_neg_default, y):
            self.default = 1
            self.default = 0
        return self
    def predict(self, X):
        if self.default is None:
            raise NotFittedException()
        scores = [self._score(tokens) for tokens in X]
        return [1 if score > 0 else 0 if score < 0 else self.default for score in scores]

Let’s test our class

mcc_test = MatchCountClassifier(['happy'], ['sad'])
X_test = [['happy'], ['sad'], ['happy', 'sad']]
y_test_1 = [1, 0, 1]
y_test_2 = [1, 0, 0], y_test_1)
assert mcc_test.default == 1
assert mcc_test.predict([['sad'], []]) == [0, 1]
assert mcc_test.ties == 1/3, y_test_2)
assert mcc_test.default == 0
assert mcc_test.predict([['sad'], []]) == [0, 0]

Human Baselines

Let’s get the tokens and labels

X = [text.split() for text in data['text']]
y = [cat2idx[l] for l in data['label']]

The ties and accuracy matches table 1 for Human 1

mcc1 = MatchCountClassifier(positive1, negative1), y)

print(f'''Human 1
Ties: {mcc1.ties:0.0%}
Accuracy: {accuracy(mcc1.predict(X), y):0.0%}''')
Human 1
Ties: 75%
Accuracy: 56%

For Human 2 we get accuracy 1 percentage point higher than the paper; likely due to corrections to the dataset.

mcc2 = MatchCountClassifier(positive2, negative2), y)

print(f'''Human 2
Ties: {mcc2.ties:0.0%}
Accuracy: {accuracy(mcc2.predict(X), y):0.0%}''')
Human 2
Ties: 39%
Accuracy: 65%

They also provide a third baseline in table 2 using statistics from the dataset

Based on a very preliminary examination of frequency counts in the entire corpus (including test data) plus introspection, we created a list of seven positive and seven negative words (including punctuation), shown in Figure 2.

positive3 = 'love, wonderful, best, great, superb, still, beautiful'.split(', ')
['love', 'wonderful', 'best', 'great', 'superb', 'still', 'beautiful']
negative3 = 'bad, worst, stupid, waste, boring, ?, !'.split(', ')
['bad', 'worst', 'stupid', 'waste', 'boring', '?', '!']

Again we get an accuracy 1% higher

mcc3 = MatchCountClassifier(positive3, negative3), y)

print(f'''Human 3 + stats
Ties: {mcc3.ties:0.0%}
Accuracy: {accuracy(mcc3.predict(X), y):0.0%}''')
Human 3 + stats
Ties: 15%
Accuracy: 70%

Could we do better?

An obvious strategy would be to combine the lists; they are mostly disjoint.

However that doesn’t improve our accuracy at all over Human 3

mcc_all = MatchCountClassifier(set(positive1 + positive2 + positive3),
                               set(negative1 + negative2 + negative3)), y)

print(f'''Combined Humans
Ties: {mcc_all.ties:0.0%}
Accuracy: {accuracy(mcc_all.predict(X), y):0.0%}''')
Combined Humans
Ties: 13%
Accuracy: 70%

Another resource that was available at the time was the Harvard General Inquirer lexicon which tags words with a positiv or negativ sentiment, among many other classifications.

I can’t find an official source for the lexicon, but there’s a version inside the pysentiment library (which may be different to what was available at the time).

import csv

harvard_inquirer_url = ''
harvard_inquirer_path = data_dir / 'HIV-4.csv'
if not harvard_inquirer_path.exists():
    urlretrieve(harvard_inquirer_url, harvard_inquirer_path)

with open(harvard_inquirer_path) as f:
    harvard_inquirer_data = list(csv.DictReader(f))

We can extract the positive and negative entries

positive_hi = [i['Entry'].lower() for i in harvard_inquirer_data if i['Positiv']]
positive_hi[:5], positive_hi[-5:], len(positive_hi)
(['abide', 'ability', 'able', 'abound', 'absolve'],
 ['worth-while', 'worthiness', 'worthy', 'zenith', 'zest'],
negative_hi = [i['Entry'].lower() for i in harvard_inquirer_data if i['Negativ']]
negative_hi[:5], negative_hi[-5:], len(negative_hi)
(['abandon', 'abandonment', 'abate', 'abdicate', 'abhor'],
 ['wrongful', 'wrought', 'yawn', 'yearn', 'yelp'],

This has fewer ties but actually a lower accuracy than human 3.

(Technically we should use stemming with Harvard Inquirer but it won’t improve matters here)

mcc_hi = MatchCountClassifier(set(positive_hi), set(negative_hi)), y)

print(f'''Harvard Inquirer
Ties: {mcc_hi.ties:0.0%}
Accuracy: {accuracy(mcc_hi.predict(X), y):0.0%}''')
Harvard Inquirer
Ties: 6%
Accuracy: 63%

Error Analysis

yhat = mcc3.predict(X)
scores = [mcc3._score(row) for row in X]
correct = [yi==yhati for yi, yhati in zip(y, yhat)]
incorrect_idx = [i for i, c in enumerate(correct) if not c]
for score, count in Counter(scores[i] for i in incorrect_idx).most_common():
    print(score, '\t', count, '\t', f'{count/len(incorrect_idx):0.2%}')
0    84      20.10%
-1   67      16.03%
1    61      14.59%
-2   51      12.20%
2    37      8.85%
-3   30      7.18%
-4   16      3.83%
3    13      3.11%
-5   11      2.63%
-6   10      2.39%
4    8   1.91%
5    7   1.67%
6    6   1.44%
-9   4   0.96%
-7   4   0.96%
-10      3   0.72%
-8   2   0.48%
17   1   0.24%
-15      1   0.24%
-12      1   0.24%
-11      1   0.24%

Look at most extreme cases

very_wrong_idx = [i for i in incorrect_idx if abs(scores[i]) >= 7]

import html

def mark_span(text, color):
        return f'<span style="background: {color};">{html.escape(text)}</span>'

def markup_html_words(text, words, color):
    word_pattern = '|'.join([r'\b' + re.escape(word) + r'\b' if len(word) > 1 else re.escape(word) for word in words])
    return re.sub(fr"({word_pattern})(?![^<]*>)", lambda match: mark_span(, color), text, flags=re.IGNORECASE)
def markup_sentiment(text, positive=positive3, negative=negative3):
    text = markup_html_words(text, positive, "lightgreen")
    text = markup_html_words(text, negative, "orange")
    return text
from IPython.display import HTML, display

def show_index(idx):
    movieid = data['movieid'][idx]
    print(f'Movie: {movieid}, Label: {data["label"][idx]}, Score: {scores[idx]}')

This movie is labelled as negative, despite being 3 out of 4 stars.

The author is a Woody Allen fan and it wasn’t his favorite Woody Allen film but it’s still pretty good.

This is a mislabelling.

Movie: 15970, Label: neg, Score: 17
Review for Celebrity (1998)

Celebrity (1998)

reviewed by
Matt Prigge

A Film Review by Ted Prigge
Copyright 1998 Ted Prigge

Writer/Director: Woody Allen Starring: Kenneth Branagh, Judy Davis, Joe Mantegna, Charlize Theron, Leonardo DiCaprio, Famke Janssen, Winona Ryder, Melanie Griffith, Bebe Neuwirth, Michael Lerner, Hank Azaria, Gretchen Mol, Dylan Baker, Jeffrey Wright, Greg Mottola, Andre Gregory, Saffron Burrows, Alfred Molina, Vanessa Redgrave, Joey Buttafuoco, Mary Jo Buttafuoco, Donald Trump

After hearing reviews for Woody Allen's upteenth movie in history, "Celebrity," range from terribly boring to just so-so, my heart lept when the opening images of the film closely resembled that of "Manhattan," my personal favorite from my personal favorite director of all time. Woody Allen's films almost never rely on visual flair over textual flair, so when one of his films closely resembles the one time that these two entities fit hand-in-hand ("Manhattan" really is one of the best-looking films I've ever seen, beautiful black and white photography of the city's best areas, etc.), a fan can't help but feel visibly moved. The film opens up, with the usual credits with plain white font over black backgrounds, and an old ironic standard playing on the soundtrack, but then the screen fills with a gorgeous dull gray sky, with the word "Help" being spelled with an airplane. Beethoven's 5th blasts on the soundtrack. The city seems to stop to take notice of this moment, and it's all rather lovely to look at.

And then we cut to a film crew, shooting this as the film's hilariously banal key moment in the film, where the lead actress in the film (Melanie Griffith, looking as buxom and beautiful as ever) has to realize something's wrong with her life or whatever. It's a terribly stale scene for a Woody Allen film, with the great opening shots or without, and my heart sank and I soon got used to the fact that once again, a new film of his was not going to be as great as his past works (though, for the record, last year's "Deconstructing Harry" came awfully close).

What the hell has happened to him? The man who once could be relied on for neurotic freshness in cinema has not become less funny, but his films have become less insightful and more like he tossed them together out of unfinished ideas. "Bullets Over Broadway," though wonderful, relies on irony to pull a farce that just never totally takes off. "Mighty Aphrodite" is more full of great moments and lines than a really great story. "Everyone Says I Love You" was more of a great idea than a great film. Even "Deconstructing Harry" is admittingly cheap in a way, even if it does top as one of his most truly hilarious films.

If anything, the reception of "Celebrity" by everyone should tip Allen off to the fact that this time, it's not the audience and critics who are wrong about how wonderful his film is: it's him. "Celebrity" is, yes, a good film, but it's only marginally satisfying as a Woody Allen film. Instead of creating the great Woody Allen world, he's created a world out of a subject he knows only a bit about. And he's fashioned a film that is based almost entirely on his uninformed philosophy of celebrities, so that it plays like a series of skits with minor connections. It's like "La Dolce Vita" without the accuracy, the right amount of wit, and the correct personal crisis.

Woody, becoming more insecure in his old age, choses to drop the Woody Allen character in on the world of celebrities, and then hang him and all his flaws up for scrutiny, and does this by casting not himself but Brit actor Kenneth Branagh in the lead. Much has been said about his performance - dead on but irritating, makes one yearn for the real thing, blah blah blah - but to anyone who actually knows the Woody Allen character knows that Branagh's performance, though featuring some of the same mannerisms (stuttering, whining, lots o' hand gestures), is hardly a warts-and-all impersonation. Branagh brings along with him little of the Woody Allen charm, which actually allows for his character's flaws to be more apparent. Woody's a flawed guy, and we know it, but we love him anyway, because he's really funny and really witty and really intelligent. Branagh's Allen is a bit more flat-out bad, but with the same charm so that, yes, we like him, but we're still not sure if he's really a good person or not.

His character, Lee Simon, is first seen on the set of the aforementioned movie, hits on extra actress Winona Ryder, then goes off to interview Griffith, who takes him to her childhood home where he makes a pass at her, and she denies him...sorta. We then learn, through flashbacks, that Lee has been sucked into trying to be a celebrity thanks to a mid-life crisis and an appearance at his high school reunion. He has since quit his job as a travel journalist and become a gossip journalist of sorts, covering movie sets and places where celebrities congregate, so that he can meet them, and maybe sell his script (a bank robbery movie "but with a deep personal crisis"). As such, he has divorced his wife of several years (Allen regular Judy Davis), and continues on a quest for sexual happiness, boucing from girlfriend to girlfriend and fling to fling over the course of the film.

After Griffith comes his escapades with a model (Charlize Theron) who is "polymorphously perverse" (glad to see Allen is using new jokes, ha ha), who takes him for a wild ride not different from that of the Anita Ekberg segment of "La Dolce Vita." Following are his safe relationship with smart working woman Famke Janssen, a relationship that almost assures him success, and his continued escapades with Ryder, whom he fancies most of all. His story is juxtaposed with that of Davis, who flips out, but stumbles onto happiness when she runs into a handsome, friendly TV exec (Joe Mantegna) who lands her a job that furthers her career to national status. While Lee is fumbling about, selfishly trying to ensure his own happiness, Davis becomes happy ("I've become the kind of woman I've always hated...and I'm loving it.") without doing a thing.

The result is a film of highs and mediums. The mediums are what take up most of the film, with sitations and scenes which don't exactly work but you can't help but pat Allen on the back for trying. But other places are really great scenes. The opening. The sequence with Theron, which is so good that I wished it hadn't ended. A banana scene with Bebe Neuwirth (droll as ever). And, perhaps the best sequence: a romp with hot-as-hell teen idol, Brandon Darrow, played by none other than Leo DiCaprio, who is so un-DiCaprio-esque that if any of this fans could sit through this film, they'd never look at him the same way. He ignites the screen with intensity, and spares nothing in showing his character as narcissistically tyrannical, and totally heartbreaking for Lee, who comes to him to talk about his script that he has read, and finds himself on a wild all-day ride with him. They go to Atlantic City to watch a fight, they gamble, and they wind up in his hotel room, where Darrow gets it on with his flame (Gretchen Mol) and he lends him one of the leftover groupies. Allen's writing in these scenes are so good that just for them, I'd almost recommend the film. Almost.

But what I really liked about this film is despite the fact that it's a mess, despite the fact that what this film really needs is a good old fashioned rewrite by Allen himself, it's still a smart and insightful film. Though some of the jokes are either stale or misplaced (some seem too cartoonish, even for this environment), Allen still manages to get across that this film is not exactly about celebrities, as it may seem to be (if it were, it'd be extremely out-of-touch), but about those who want to be celebrities, and how they equate celebrity-hood with happiness. We never get close enough to the actual celebrities to see if they're really happy (they may appear to be on the surface...), but we do get close enough to Lee and Davis' character. Lee is obsessed with the phenomenon, while Davis takes is at arm's length, and never gets too involved in what it is, and soon becomes one herself.

Besides, it's witty, and it does have the one thing that no other film has but Allen's: that great Woody Allen feel. It may be not exactly fresh and lively or totally brilliant in its depiction of its subject, and yes, as a part of Woody Allen's oeuvre, it's merely a blip (no "Annie Hall" but it's no "Shadows and Fog" either), but it goes to prove that no one can make a film like him, and only he and maybe Godard could possibly take a totally horrible metaphor, like the one in the beginning, and make it work not once but twice.

MY RATING (out of 4): ***

Homepage at:

The review above was posted to the newsgroup ( for German reviews).
The Internet Movie Database accepts no responsibility for the contents of the review and has no editorial control. Unless stated otherwise, the copyright belongs to the author.
Please direct comments/criticisms of the review to relevant newsgroups.
Broken URLs inthe reviews are the responsibility of the author.
The formatting of the review is likely to differ from the original due to ASCII to HTML conversion.

Related links: index of all reviews

Interestingly 15970 one of the “corrections” pos->neg

diff_txt = sentiment_fh.extractfile(sentiment_fh.getmember('diff.txt')).read().decode(CODEC).rstrip()

== Changes made == 
-> mix20_rand700_tokens_0211.tar.gz

Removed : (non-English/incomplete reviews)



Moved: (based on Nathan's judgement when he read the review,
sometimes different from the original author's own rating,
as listed below)

neg -> pos:
cv279_tok-23947.txt *1/2, but reads positive
cv346_tok-24609.txt     misclassification
cv375_tok-0514.txt      misclassification
cv389_tok-8969.txt  misclassification
cv425_tok-8417.txt  several reviews together
cv518_tok-11610.txt     misclassification

pos -> neg:
cv017_tok-29801.txt     *** Average, hits and misses 
cv352_tok-15970.txt     (out of 4): *** 
cv375_tok-21437.txt     * * * - Okay movie, hits and misses
cv377_tok-7572.txt  *** Pretty good, bring a friend
cv546_tok-23965.txt     * * * - Okay movie, hits and misses

A lot of the other examples were based on repeated punctuation (exclamation marks and question marks)

Counter(word for word in data['text'][very_wrong_idx[1]].split() if word in positive3 + negative3)
Counter({'bad': 1,
         'stupid': 1,
         'worst': 2,
         '!': 9,
         'great': 2,
         '?': 6,
         'still': 2})
Counter(word for word in data['text'][very_wrong_idx[2]].split() if word in positive3 + negative3)
Counter({'!': 3, '?': 6, 'bad': 1, 'great': 1})
Counter(word for word in data['text'][very_wrong_idx[3]].split() if word in positive3 + negative3)
Counter({'?': 9, 'bad': 1})
Counter(word for word in data['text'][very_wrong_idx[4]].split() if word in positive3 + negative3)
Counter({'?': 11, 'love': 2, 'bad': 1})

Machine Learning Methods

We’ll now use traditional machine learning methods. For showing the different methods we’ll use a small vocabulary from the human baseline.

To keep this section reasonable length we’ll use sklearn implementations of the methods.

vocab = positive3 + negative3

We’ll create a Feature vector from this vocabulary for each document

word_counts = [Counter(word for word in doc if word in vocab) for doc in X]

X_feature = [[row[word] for word in vocab] for row in word_counts]

dict(zip(vocab, X_feature[0]))
{'love': 0,
 'wonderful': 0,
 'best': 0,
 'great': 1,
 'superb': 0,
 'still': 1,
 'beautiful': 0,
 'bad': 0,
 'worst': 0,
 'stupid': 0,
 'waste': 0,
 'boring': 0,
 '?': 1,
 '!': 0}

And split it into train and test sets by the cvid

X_train = [row for row, cvid in zip(X_feature, data['cvid']) if int(cvid) // 233 < 2]
X_test = [row for row, cvid in zip(X_feature, data['cvid']) if int(cvid) // 233 >= 2]

y_train = [row for row, cvid in zip(y, data['cvid']) if int(cvid) // 233 < 2]
y_test = [row for row, cvid in zip(y, data['cvid']) if int(cvid) // 233 >= 2]

len(X_train), len(X_test), len(y_train), len(y_test)
(921, 465, 921, 465)

Naive Bayes

The text states they use Naive Bayes with add-1 smoothing (so in sklearn alpha=1.0):

from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=1.0), y_train)

accuracy(nb.predict(X_test), y_test)

There is another way to do Naive Bayes where each word is taken as an independent feature, but it tends to be worse for NLP.

from sklearn.naive_bayes import BernoulliNB

nb = BernoulliNB(binarize=1.0, alpha=1.0), y_train)
accuracy(nb.predict(X_test), y_test)
nb = MultinomialNB(alpha=1.0)

Maximum Entropy

Maximum Entropy is an old NLP term for Logistic Regression

We use ten iterations of the improved iterative scaling algorithm (Della Pietra et al., 1997) for parameter training (this was a sufficient number of iterations for convergence of training-data accuracy), together with a Gaussian prior to prevent overfitting (Chen and Rosenfeld, 2000).

A Gaussian Prior is equivalent to an L2 penalty, but they don’t specify the size of the prior and I can’t access the referenced paper. I’ll stick to the default in sklearn of 1.0 (the solver shouldn’t matter much).

from sklearn.linear_model import LogisticRegression

me = LogisticRegression(penalty='l2', solver='liblinear', C=1.0), y_train)
accuracy(me.predict(X_test), y_test)

Note that the amount of regularization can actually matter. Ideally we’d keep a small holdout set for hyperparameter tuning, but I’ll stick to the methods in the original.

me = LogisticRegression(penalty='l2', solver='liblinear', C=0.5), y_train)
accuracy(me.predict(X_test), y_test)
me = LogisticRegression(penalty='l2', solver='liblinear', C=1.0)

Support Vector Machines

We used Joachim’s (1999) SVM light package for training and testing, with all parameters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly).

There’s little detail as to the hyperparameters again, so I’ll use the default.

from sklearn.svm import LinearSVC
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline

svm = Pipeline([('norm', Normalizer(norm='l2')),
                ('svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))]), y_train)
accuracy(svm.predict(X_test), y_test)

I’m not clear on what length-normalizing is, but L2 normalizing looks like it works better than L1

svm = Pipeline([('norm', Normalizer(norm='l1')),
                ('svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))]), y_train)
accuracy(svm.predict(X_test), y_test)

And this works better than not normalizing it at all

svm = LinearSVC(penalty='l2', loss='squared_hinge', C=1.0), y_train)
accuracy(svm.predict(X_test), y_test)
/home/eross/mambaforge/envs/pang_lee_2003/lib/python3.8/site-packages/sklearn/svm/ ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.

As with Maximum Entropy it’s sensitive to the amount of regularizaiton.

svm = Pipeline([('norm', Normalizer(norm='l2')),
                ('svc', LinearSVC(penalty='l2', loss='squared_hinge', C=2.0))]), y_train)
accuracy(svm.predict(X_test), y_test)

We’ll reset everything to the defaults

svm = Pipeline([('norm', Normalizer(norm='l2')),
                ('svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))])


Experimental Set-up

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

we randomly selected 700 positive-sentiment and 700 negative-sentiment documents

Counter({0: 692, 1: 694})

We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold.

folds = [[idx for idx, cvid in enumerate(data['cvid']) if int(cvid) // 233 == i] for i in range(3)]

[len(f) for f in folds]
[459, 462, 463]
cv = [(folds[0] + folds[1], folds[2]),
      (folds[0] + folds[2], folds[1]),
      (folds[1] + folds[2], folds[0]),

One unconventional step we took was to attempt to model the potentially important contextual effect of negation: clearly “good” and “not very good” indicate opposite sentiment orientations. Adapting a technique of Das and Chen (2001), we added the tag NOT to every word between a negation word (“not”, “isn’t”, “didn’t”, etc.) and the first punctuation mark following the negation word.

To do this we can first get the negation words with a surprisingly effective heuristic:

words = Counter(word for text in X for word in text)

negation_words = [w for (w,c) in words.most_common(10_000) if re.match(".*n[o']t$", w)]

Then we can use a simple finite state machine to add the negation tokens.

punctuation = '!?.,()[];:,"'

def negation_mark(x):
    return 'NOT_' + x

def add_negation_tag(tokens, negation_words=negation_words, punctuation=punctuation, negation_mark=negation_mark):
    in_negation = False
    tagged_tokens = []
    for token in tokens:
        if token in negation_words:
            in_negation = not in_negation
        elif token in punctuation:
            in_negation = False
        elif in_negation:
            token = negation_mark(token)
    return tagged_tokens

def text_add_negation_tag(s: str, **kwargs) -> str:
    return ' '.join(add_negation_tag(tokens=s.split(), **kwargs))

text_add_negation_tag("this isn't a great movie , it is terrible")
"this isn't NOT_a NOT_great NOT_movie , it is terrible"

For this study, we focused on features based on unigrams (with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at least four times in our 1400-document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams, since we consider bigrams (and n-grams in general) to be an orthogonal way to incorporate context.

There’s a slight issue here in using the vocabulary based on the entire dataset; the features should only be selected by the data in each fold otherwise you could be overfitting.

We’ll be making lots of this so we’ll make a factory function to remove some of the boilerplate.

def make_count_vectorizer(
    token_pattern=r"[^ ]+",
    return CountVectorizer(input=input,

unigram_freq_vectorizer = make_count_vectorizer(preprocessor=text_add_negation_tag, binary=False)

unigram_vectorizer = make_count_vectorizer(preprocessor=text_add_negation_tag)

bigram_vectorizer = make_count_vectorizer(ngram_range=(2,2))


From the README there’s an updated Figure 3

Features # features NB ME SVM
1 unigrams (freq.) 16162 79.0 n/a 73.0
2 unigrams 16162 81.0 80.2 82.9
3 unigrams+bigrams 32324 80.7 80.7 82.8
4 bigrams 16162 77.3 77.5 76.5
5 unigrams+POS 16688 81.3 80.3 82.0
6 adjectives 2631 76.6 77.6 75.3
7 top 2631 unigrams 2631 80.9 81.3 81.2
8 unigrams+position 22407 80.8 79.8 81.8

For each feature function (row in the table) we want to apply it followed by each of the 3 types of classifiers

All results reported below, as well as the baseline results from Section 4, are the average three-fold cross-validation results on this data.

We can use cross_val_score to do the cross-validation.

models = {
    'NB': nb,
    'ME': me,
    'SVM': svm

results = {}

def evaluate(vectorizer):
    results = {}
    for model_name, model in models.items():
        pipeline = Pipeline(steps=[('Vectorizer', vectorizer), ('model', model)])
        cv_score = cross_val_score(pipeline, data['text'], y, cv=cv, scoring='accuracy')
        results[model_name] = cv_score.mean()
    return results

Initial Unigram Results

We can show the results to compare with the table

def display_results(results):
    print('\t'.join([f'{v:0.1%}' for v in results.values()]))
def display_results(results):
    result_html = '<table><tr>' + ''.join([f'<th>{r}</th>' for r in results]) + '</tr><tr>' + \
                  ''.join([f'<td>{v:0.1%}</td>' for v in results.values()]) + '</tr></table>'

This does a little better than the table from the README, which is surprising especially for Naive Bayes which is deterministic with no tunable parameters.

results['unigram_freq'] = evaluate(unigram_freq_vectorizer)

80.3% 80.5% 77.0%

Feature frequency vs. presence

As in the paper using presence features gives a significant lift, but again all the accuracies are much higher here. In particular Maximum Entropy (Logistic Regression) is higher than Naive Bayes here and much closer to SVM, which suggests there could be poor regularisation in the original.

results['unigram'] = evaluate(unigram_vectorizer)

81.9% 84.2% 84.4%


Line (3) of the results table shows that bigram information does not improve performance beyond that of unigram presence, although adding in the bigrams does not seriously impact the results, even for Naive Bayes

This is still observed here, but again our results are slightly higher than the authors.

results['unigram_bigram'] = evaluate(FeatureUnion([('unigram', unigram_vectorizer), ('bigram', bigram_vectorizer)]))

80.7% 83.8% 84.4%

We have similar observations for bigrams:

However, comparing line (4) to line (2) shows that relying just on bigrams causes accuracy to decline by as much as 5.8 percentage points.

results['bigram'] = evaluate(bigram_vectorizer)

77.6% 78.0% 78.3%

We see a marginally larger decline

{k: f"{results['bigram'][k] - results['unigram'][k] :0.2%}" for k in results['bigram']}
{'NB': '-4.26%', 'ME': '-6.14%', 'SVM': '-6.07%'}

Parts of speech

We also experimented with appending POS tags to every word via Oliver Mason’s Qtag program

Unfortunately I can’t access QTag, but instead will use a much more modern (and likely more accurate) Averaged Perceptron Tagger from NLTK. It’s relatively expensive so I’ll cache the calculations across models.

import nltk
from functools import lru_cache'averaged_perceptron_tagger')

def pos_tag(doc):
    return nltk.pos_tag(doc.split())

pos_tag('The quick brown fox jumped over the lazy dog.')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/eross/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumped', 'VBD'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog.', 'NN')]

We can then append the tags as follows:

def pos_marker(doc):
    return ' '.join([token + '_' + tag for token, tag in pos_tag(doc)])


However, the effect of this information seems to be a wash: as depicted in line (5) of Figure 3, the accuracy improves slightly for Naive Bayes but declines for SVMs, and the performance of MaxEnt is unchanged.

We actually get worse results across the board

unigram_pos_vectorizer = make_count_vectorizer(preprocessor=pos_marker)

results['unigram_pos'] = evaluate(unigram_pos_vectorizer)

80.6% 82.4% 82.5%
81.9% 84.2% 84.4%

It’s not the effect of dropping negations either(in fact that makes Naive Bayes and SVM do slightly better)

unigram_no_negation_vectorizer = make_count_vectorizer(preprocessor=None)

82.2% 82.9% 83.3%

Since adjectives have been a focus of previous work in sentiment detection (Hatzivassiloglou and Wiebe, 2000; Turney, 2002), we looked at the performance of using adjectives alone.

def adjective_extractor(doc):
    adjectives = [token for token, tag in pos_tag(doc) if tag == 'JJ']
    return ' '.join(adjectives)

"united wide mpaa male full-frontal female graphic sexual frequent theatrical daphne murray liber kimball wild dreary early frontal late early lucrative mtv fast-paced slick flashy mindless wild first eleven convincing much flesh narrative wild real increasingly- improbable predictable serpentine idiotic steven seagal easy unlikely right wrong good next hot young old screen i'll film's erotic impressive wild risqué soft-core generic much lesbian token iron-clad film's full kevin few fully-clothed thirteenth titanic familiar wild john last finely-tuned psychological normal copious real powerful difficult same i mainstream previous wide-release mad box-office quick pretty main wild occasional sam matt florida's blue high curvaceous van theresa daphne skeptical suzie neve similar wild isn't good much ludicrous triple-digit on-screen nice slutty see-through one-piece kevin only interesting right secret wild absurd basic kinetic wild great superficial trash"

Yet, the results, shown in line (6) of Figure 3, are relatively poor: the 2633 adjectives provide less useful information than unigram presence.

adjective_vectorizer = make_count_vectorizer(preprocessor=lambda x: adjective_extractor(text_add_negation_tag(x)),

results['adjective'] = evaluate(adjective_vectorizer)

77.2% 76.2% 76.2%

Indeed, line (7) shows that simply using the 2633 most frequent unigrams is a better choice, yielding performance comparable to that of using (the presence of) all 16165 (line (2)).

unigram_2633_vectorizer = make_count_vectorizer(preprocessor=text_add_negation_tag, max_features=2633)

results['unigram_2633'] = evaluate(unigram_2633_vectorizer)

81.0% 81.9% 83.2%


As a rough approximation to determining this kind of structure, we tagged each word according to whether it appeared in the first quarter, last quarter, or middle half of the document.

def quartile_marker(doc):
    tokens = doc.split()
    quartile = len(tokens) // 4
    output_tokens = [ token + f'_{i // quartile}' for i, token in enumerate(tokens)]
    return ' '.join(output_tokens)

quartile_marker('this is an example text with several words in it')
'this_0 is_0 an_1 example_1 text_2 with_2 several_3 words_3 in_4 it_4'

I suspect they kept the top vocabulary across the whole document, but we will generate it by quartile.

The results (line (8)) didn’t differ greatly from using unigrams alone, but more refined notions of position might be more successful.

position_vectorizer =  make_count_vectorizer(preprocessor=quartile_marker, max_features=22430)

results['position'] = evaluate(position_vectorizer)

78.7% 79.8% 80.1%

Let’s see all our results

import pandas as pd

unigram_freq 80.3% 80.5% 77.0%
unigram 81.9% 84.2% 84.4%
unigram_bigram 80.7% 83.8% 84.4%
bigram 77.6% 78.0% 78.3%
unigram_pos 80.6% 82.4% 82.5%
adjective 77.2% 76.2% 76.2%
unigram_2633 81.0% 81.9% 83.2%
position 78.7% 79.8% 80.1%

Which is comparable to the original table

Features # features NB ME SVM
1 unigrams (freq.) 16162 79.0 n/a 73.0
2 unigrams 16162 81.0 80.2 82.9
3 unigrams+bigrams 32324 80.7 80.7 82.8
4 bigrams 16162 77.3 77.5 76.5
5 unigrams+POS 16688 81.3 80.3 82.0
6 adjectives 2631 76.6 77.6 75.3
7 top 2631 unigrams 2631 80.9 81.3 81.2
8 unigrams+position 22407 80.8 79.8 81.8


The conclustions from the original still hold up:

The results produced via machine learning techniques are quite good in comparison to the human-generated baselines discussed in Section 4. In terms of relative performance, Naive Bayes tends to do the worst and SVMs tend to do the best, although the differences aren’t very large. On the other hand, we were not able to achieve accuracies on the sentiment classification problem comparable to those reported for standard topic-based categorization, despite the several different types of features we tried. Unigram presence information turned out to be the most effective; in fact, none of the alternative features we employed provided consistently better performance once unigram presence was incorporated.

They also give this analysis:

As it turns out, a common phenomenon in the documents was a kind of “thwarted expectations” narrative, where the author sets up a deliberate contrast to earlier discussion.

We now have much more advanced methods, in particular neural methods that can take the context into account.

On compute

This paper was relatively easy to reproduce because computers are so much faster than when they wrote it, and software is so much better.

It’s worth keeping in mind the change in technology in the last 20 years; the fastest supercomputer in the world at the time, the Earth Simulator could perform just under 36 TFLOPS, about the same as 4 NVIDIA A100’s that could be rented today in the cloud for under $5 an hour (AWS was just starting up around this time).

For a more grounded comparison in 2003 Apple released the Power Mac G5 which at the time was a powerful consumer computer, but some benchmarks from 2017 show it’s around 10-100 times slower and a 7th Generation Intel i7. It had 256MB of RAM, where a mid-range laptop today would have at least 8GB, about 30 times more. This meant recalculating the features 9 times (once per split and per model) was a reasonable thing to do, but would have been very unreasonable at the time.

Also software has come a long way, Scikit-Learn started in 2007 and made all of the feature fitting trivial, that at the time would have involved plugging together different systems (likely in C and Java).

It’s interesting to think what things may be like in another 20 years.