from urllib.request import urlretrieve
from pathlib import Path
from zipfile import ZipFile
import tarfile
import re
= Path('data')
data_dir =True)
data_dir.mkdir(exist_ok
= 'http://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip'
source_html_url
= data_dir / 'polarity_html.zip'
raw_html_path if not raw_html_path.exists():
urlretrieve(source_html_url, raw_html_path)
= ZipFile(raw_html_path) raw_html_zip
Thumbs Up? Sentiment Classification Like it’s 2002
Introduction
In July 2002 Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan published Thumbs up? Sentiment Classification using Machine Learning Techniques. at EMNLP, one of the earliest works of using machine learning for Sentiment Classification. It was an influential paper, winning a test of time award at NAACL 2018, and at the time of writing has over 11,000 citations. This work led to their follow up Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales and this dataset was the basis for the Stanford Sentiment Treebank dataset released in Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Socher et al., which is widely used partly because of it’s inclusion in GLUE.
This paper aims to show that classifying the sentiment of movie reviews is a more challenging problem to develop machine learning techniques on than the existing topic classification problems, and motivate further work (in which they were successful!) They do this by building a self-labelled dataset of polar movie reviews from Usenet and then show baseline classifiers don’t work as well as existing topic classification datasets.
This notebook aims to explore the paper and its methods in more detail, and the headings follow the paper section by section. We go much deeper into the data than the paper, and reproduce their methods, and get similar (but slightly better) results. A good future work would be to look into applying more modern methods on this dataset.
The Movie Review Domain
They took reviews from the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews
, took the reviews with a numerical or star rating and labelled the highest scored ones positive, the lowest negative, and removed the rest.
The IMDb archive no longer exists, but there are current archives of this newsgroup in Google Groups and the Usenet Archives. Thankfully the authors released their original data both the raw HTML they extracted and the extracted text they used for classification.
Let’s take a look at the HTML to see what they worked with
The zipfile contains a single directory movie
containing around 27k review files
len(raw_html_zip.infolist())
27887
5] raw_html_zip.infolist()[:
[<ZipInfo filename='movie/0002.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=4415 compress_size=2170>,
<ZipInfo filename='movie/0003.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=2702 compress_size=1398>,
<ZipInfo filename='movie/0004.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=6165 compress_size=3059>,
<ZipInfo filename='movie/0005.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=4427 compress_size=2103>,
<ZipInfo filename='movie/0006.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=6423 compress_size=3225>]
-5:] raw_html_zip.infolist()[
[<ZipInfo filename='movie/9995.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=5232 compress_size=2643>,
<ZipInfo filename='movie/9997.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=10113 compress_size=4812>,
<ZipInfo filename='movie/9998.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=3868 compress_size=1935>,
<ZipInfo filename='movie/9999.html' compress_type=deflate filemode='-rw-rw-rw-' file_size=3081 compress_size=1605>,
<ZipInfo filename='movie/' filemode='drwxrwxrwx' external_attr=0x10>]
Let’s have a look at one of them (that’s not too long, and recent enough to be in other archives); you could also see it on the Usenet Archives or Google Groups.
Note that the original was almost certainly a plaintext email; some the HTML markup (in particular the footer) would have been added by IMDB. Note that the rating is stated twice in the review as a “low 0” on a scale from -4 to 4; this begins to indicate the difficulty of automatically extracting the ratings.
= 'ISO-8859-1'
CODEC = raw_html_zip.read('movie/0908.html').decode(CODEC)
movie_review_html print(movie_review_html)
<HTML><HEAD>
<TITLE>Review for Flight of the Intruder (1990)</TITLE>
<LINK REL="STYLESHEET" TYPE="text/css" HREF="/ramr.css">
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1 ALIGN="CENTER" CLASS="title"><A HREF="/Title?0099587">Flight of the Intruder (1990)</A></H1><H3 ALIGN=CENTER>reviewed by<BR><A HREF="/ReviewsBy?Mark+R.+Leeper">Mark R. Leeper</A></H3><HR WIDTH="40%" SIZE="4">
<PRE> FLIGHT OF THE INTRUDER
A film review by Mark R. Leeper
Copyright 1991 Mark R. Leeper</PRE>
<P> Capsule review: Pretty pictures, stupid story. The
air-war of a previous conflict is occasionally entertaining
to watch but the plot is cliched as are most of the
characters. This film's only chance is to follow the current
wave of interest in military equipment. Rating: low 0.</P>
<P> Had I not actually seen a copy of the book FLIGHT OF THE INTRUDER by
Stephen Coonts, I would have had a hard time telling if this was a very weak
story given classy military equipment photography and quality special
effects treatment or if this was just a collection of classy military
equipment photography and quality special effects tied together by a very
weak excuse for a story. During World War II a lot of B war movies carried
stories just as good to the bottom half of double bills. We are talking
HELLCATS OF THE NAVY-level plotting here. In 1972 Vietnam we have an
aircraft carrier ruled over by a cigar-chewing, mean-as-a-junkyard-dog-but-
heart-of-gold sort of commander. Danny Glover plays the Black commander
with the unlikely name Frank Camparelli. One of his bright young pilots,
Jake Grafton (played by the uninteresting Brad Johnson) agonizes over the
loss of his bombardier. The companion is lost in a raid that accomplishes
nothing besides adding visual interest to the opening credits. Grafton
wants to go on a super-special raid of his own devising. But this raid is
directly contrary to orders. His top-gun replacement bombardier Virgil Cole
(played by Willem Dafoe) says absolutely not. Does Jake get to make his
super-special raid on North Vietnam? And if he does, what is the Navy's
reaction?</P>
<P> The weak story is, however, punctuated by pretty pictures of planes,
helicopters, and aircraft carriers to keep the audience watching. If this
film stands any chance with audiences it is in the fortuitous timing of this
film coincident with a sudden upsurge of interest in technical weaponry.
Indeed many people may find events in the Middle East resonating with
attitudes in this film. On the other hand, maybe some people would prefer
to stay home and watch technical weaponry on television.</P>
<P> FLIGHT OF THE INTRUDER is directed by John Milius, who is specializing
in gutsy films like APOCALYPSE NOW (which he wrote), CONAN THE BARBARIAN,
and RED DAWN. The score is by Basil Poledouris, the gifted composer of the
scores for the "Conan" films, who seems repeatedly associated with films
with right-wing themes. Poledouris scored RED DAWN, AMERIKA, and THE HUNT
FOR RED OCTOBER.</P>
<P> FLIGHT OF THE INTRUDER is linked in advertising with THE HUNT FOR RED
OCTOBER, but it falls well short of that film's interest value and quality.
My rating is a low 0 on the -4 to +4 scale.</P>
<PRE> Mark R. Leeper
att!mtgzy!leeper
<A HREF="mailto:leeper@mtgzy.att.com">leeper@mtgzy.att.com</A>
.
</PRE>
<HR><P CLASS=flush><SMALL>The review above was posted to the
<A HREF="news:rec.arts.movies.reviews">rec.arts.movies.reviews</A> newsgroup (<A HREF="news:de.rec.film.kritiken">de.rec.film.kritiken</A> for German reviews).<BR>
The Internet Movie Database accepts no responsibility for the contents of the
review and has no editorial control. Unless stated otherwise, the copyright
belongs to the author.<BR>
Please direct comments/criticisms of the review to relevant newsgroups.<BR>
Broken URLs inthe reviews are the responsibility of the author.<BR>
The formatting of the review is likely to differ from the original due
to ASCII to HTML conversion.
</SMALL></P>
<P ALIGN=CENTER>Related links: <A HREF="/Reviews/">index of all rec.arts.movies.reviews reviews</A></P>
</P></BODY></HTML>
For convenience let’s define a function to read the HTML of a given movie
def get_movie_html(movieid):
return raw_html_zip.read(f'movie/{movieid}.html').decode(CODEC)
print(get_movie_html('0908')[:1000])
<HTML><HEAD>
<TITLE>Review for Flight of the Intruder (1990)</TITLE>
<LINK REL="STYLESHEET" TYPE="text/css" HREF="/ramr.css">
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1 ALIGN="CENTER" CLASS="title"><A HREF="/Title?0099587">Flight of the Intruder (1990)</A></H1><H3 ALIGN=CENTER>reviewed by<BR><A HREF="/ReviewsBy?Mark+R.+Leeper">Mark R. Leeper</A></H3><HR WIDTH="40%" SIZE="4">
<PRE> FLIGHT OF THE INTRUDER
A film review by Mark R. Leeper
Copyright 1991 Mark R. Leeper</PRE>
<P> Capsule review: Pretty pictures, stupid story. The
air-war of a previous conflict is occasionally entertaining
to watch but the plot is cliched as are most of the
characters. This film's only chance is to follow the current
wave of interest in military equipment. Rating: low 0.</P>
<P> Had I not actually seen a copy of the book FLIGHT OF THE INTRUDER by
Stephen Coonts, I would have h
Cleaned Text
And the cleaned and labelled text we’ll get version 1.1 which according to the README has some corrections over the version used in the paper.
= 'http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz'
sentiment_url = data_dir / 'sentiment.tar.gz'
sentiment_path
if not sentiment_path.exists():
urlretrieve(sentiment_url, sentiment_path)
urlretrieve(sentiment_url, sentiment_path)
= tarfile.open(sentiment_path) sentiment_fh
There’s:
- a
diff.txt
that says what changed between versions - a
README
describing the dataset - subfolders
neg/
andpos/
containing negative and positive reviews
10] sentiment_fh.getnames()[:
['diff.txt',
'README',
'tokens',
'tokens/neg',
'tokens/neg/cv303_tok-11557.txt',
'tokens/neg/cv000_tok-9611.txt',
'tokens/neg/cv001_tok-19324.txt',
'tokens/neg/cv002_tok-3321.txt',
'tokens/neg/cv003_tok-13044.txt',
'tokens/neg/cv004_tok-25944.txt']
-10:] sentiment_fh.getnames()[
['tokens/pos/cv690_tok-23617.txt',
'tokens/pos/cv691_tok-11491.txt',
'tokens/pos/cv692_tok-24295.txt',
'tokens/pos/cv693_tok-16307.txt',
'tokens/pos/cv694_tok-18628.txt',
'tokens/pos/cv695_tok-12873.txt',
'tokens/pos/cv696_tok-10835.txt',
'tokens/pos/cv697_tok-29325.txt',
'tokens/pos/cv698_tok-27735.txt',
'tokens/pos/cv699_tok-10425.txt']
We can extract the label (pos or neg) the cross-validation id (cvid
), and the movie id from the filename with a regular expression
= re.compile(r'^tokens/(?P<label>[^/]+)/cv(?P<cvid>[0-9]+)_tok-(?P<movieid>[0-9]+).txt$')
pattern
'tokens/neg/cv303_tok-11557.txt').groupdict() pattern.match(
{'label': 'neg', 'cvid': '303', 'movieid': '11557'}
Let’s extract all the data into aligned lists
= {
data 'label': [],
'cvid': [],
'movieid': [],
'text': []
}
for member in sentiment_fh:
= pattern.match(member.name)
match if not match:
print('Skipping %s' % member.name)
continue
for k, v in match.groupdict().items():
data[k].append(v)'text'].append(sentiment_fh.extractfile(member).read().decode(CODEC).rstrip())
data[
len(v) for k,v in data.items()} {k:
Skipping diff.txt
Skipping README
Skipping tokens
Skipping tokens/neg
Skipping tokens/pos
{'label': 1386, 'cvid': 1386, 'movieid': 1386, 'text': 1386}
To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 reviews per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.
…
we randomly selected 700 positive-sentiment and 700 negative-sentiment documents
We get slightly fewer likely due to the updates since it was first released
from collections import Counter
'label']) Counter(data[
Counter({'neg': 692, 'pos': 694})
Data Lengths
These reviews can be quite long, and the tokenization of punctuation is quite aggressive; how long are the actual tokens?
= [len(text.split()) for text in data['text']] lengths
def median(l):
return sorted(l)[len(l)//2]
def mean(l):
return sum(l)/len(l)
The median length of reviews is around 700
median(lengths), mean(lengths)
Note that negative reviews tend to be a little shorter than positive reviews.
= [x for x, l in zip(lengths, data['label']) if l == 'neg']
neg_lengths median(neg_lengths), mean(neg_lengths)
(681, 715.5664739884393)
= [x for x, l in zip(lengths, data['label']) if l == 'pos']
pos_lengths median(pos_lengths), mean(pos_lengths)
(735, 798.0345821325649)
We can use this to get a few percentage points without even looking at the data
= ['pos' if l > 730 else 'neg' for l in lengths] preds
def accuracy(preds, actuals):
if len(preds) != len(actuals):
raise ValueError('Expected same length input')
return mean([p==a for p,a in zip(preds, actuals)])
'label']) accuracy(preds, data[
0.5440115440115441
A Closer Look at the Problem
From the paper they set a benchmark using manual lists of positive and negative words:
One might also suspect that there are certain words people tend to use to express strong sentiments, so that it might suffice to simply produce a list of such words by introspection and rely on them alone to classify the texts.
To test this latter hypothesis, we asked two graduate students in computer science to (independently) choose good indicator words for positive and negative sentiments in movie reviews.
Extracting these from Figure 1
= 'dazzling, brilliant, phenomenal, excellent, fantastic'.split(', ')
positive1 positive1
['dazzling', 'brilliant', 'phenomenal', 'excellent', 'fantastic']
= 'suck, terrible, awful, unwatchable, hideous'.split(', ')
negative1 negative1
['suck', 'terrible', 'awful', 'unwatchable', 'hideous']
= 'gripping, mesmerizing, riveting, spectacular, cool, ' \
positive2 'awesome, thrilling, badass, excellent, moving, exciting'.split(', ')
positive2
['gripping',
'mesmerizing',
'riveting',
'spectacular',
'cool',
'awesome',
'thrilling',
'badass',
'excellent',
'moving',
'exciting']
= 'bad, cliched, sucks, boring, stupid, slow'.split(', ')
negative2 negative2
['bad', 'cliched', 'sucks', 'boring', 'stupid', 'slow']
We then converted their responses into simple decision procedures that essentially count the number of the proposed positive and negative words in a given document.
We can build a small classifier to do this. When there’s a tie we need to decide how to break it with a “default”.
From the paper
Note that the tie rates — percentage of documents where the two sentiments were rated equally likely — are quite high (we chose a tie breaking policy that maximized the accuracy of the baselines)
We need to look at the data for this.
= ['neg', 'pos']
idx2cat = {'neg': 0, 'pos': 1} cat2idx
class NotFittedException(Exception):
pass
class MatchCountClassifier:
def __init__(self, positive, negative):
self.positive = positive
self.negative = negative
self.default = None
self.ties = None
def _score(self, tokens):
"""Return number of positive words - number of negative words in token"""
= len([t for t in tokens if t in self.positive])
pos_count = len([t for t in tokens if t in self.negative])
neg_count return pos_count - neg_count
def fit(self, X, y):
"""Find default that maximises """
= [self._score(tokens) for tokens in X]
scores self.ties = len([x for x in scores if x==0]) / len(scores)
= [1 if x >= 0 else 1 for x in scores]
pred_pos_default = [0 if x <= 0 else 0 for x in scores]
pred_neg_default
if accuracy(pred_pos_default, y) >= accuracy(pred_neg_default, y):
self.default = 1
else:
self.default = 0
return self
def predict(self, X):
if self.default is None:
raise NotFittedException()
= [self._score(tokens) for tokens in X]
scores
return [1 if score > 0 else 0 if score < 0 else self.default for score in scores]
Let’s test our class
= MatchCountClassifier(['happy'], ['sad']) mcc_test
= [['happy'], ['sad'], ['happy', 'sad']]
X_test = [1, 0, 1]
y_test_1 = [1, 0, 0] y_test_2
mcc_test.fit(X_test, y_test_1)assert mcc_test.default == 1
assert mcc_test.predict([['sad'], []]) == [0, 1]
assert mcc_test.ties == 1/3
mcc_test.fit(X_test, y_test_2)assert mcc_test.default == 0
assert mcc_test.predict([['sad'], []]) == [0, 0]
Human Baselines
Let’s get the tokens and labels
= [text.split() for text in data['text']]
X = [cat2idx[l] for l in data['label']] y
The ties and accuracy matches table 1 for Human 1
= MatchCountClassifier(positive1, negative1)
mcc1
mcc1.fit(X, y)
print(f'''Human 1
Ties: {mcc1.ties:0.0%}
Accuracy: {accuracy(mcc1.predict(X), y):0.0%}''')
Human 1
Ties: 75%
Accuracy: 56%
For Human 2 we get accuracy 1 percentage point higher than the paper; likely due to corrections to the dataset.
= MatchCountClassifier(positive2, negative2)
mcc2
mcc2.fit(X, y)
print(f'''Human 2
Ties: {mcc2.ties:0.0%}
Accuracy: {accuracy(mcc2.predict(X), y):0.0%}''')
Human 2
Ties: 39%
Accuracy: 65%
They also provide a third baseline in table 2 using statistics from the dataset
Based on a very preliminary examination of frequency counts in the entire corpus (including test data) plus introspection, we created a list of seven positive and seven negative words (including punctuation), shown in Figure 2.
= 'love, wonderful, best, great, superb, still, beautiful'.split(', ')
positive3 positive3
['love', 'wonderful', 'best', 'great', 'superb', 'still', 'beautiful']
= 'bad, worst, stupid, waste, boring, ?, !'.split(', ')
negative3 negative3
['bad', 'worst', 'stupid', 'waste', 'boring', '?', '!']
Again we get an accuracy 1% higher
= MatchCountClassifier(positive3, negative3)
mcc3
mcc3.fit(X, y)
print(f'''Human 3 + stats
Ties: {mcc3.ties:0.0%}
Accuracy: {accuracy(mcc3.predict(X), y):0.0%}''')
Human 3 + stats
Ties: 15%
Accuracy: 70%
Could we do better?
An obvious strategy would be to combine the lists; they are mostly disjoint.
However that doesn’t improve our accuracy at all over Human 3
= MatchCountClassifier(set(positive1 + positive2 + positive3),
mcc_all set(negative1 + negative2 + negative3))
mcc_all.fit(X, y)
print(f'''Combined Humans
Ties: {mcc_all.ties:0.0%}
Accuracy: {accuracy(mcc_all.predict(X), y):0.0%}''')
Combined Humans
Ties: 13%
Accuracy: 70%
Another resource that was available at the time was the Harvard General Inquirer lexicon which tags words with a positiv
or negativ
sentiment, among many other classifications.
I can’t find an official source for the lexicon, but there’s a version inside the pysentiment library (which may be different to what was available at the time).
import csv
= 'https://raw.githubusercontent.com/nickderobertis/pysentiment/master/pysentiment2/static/HIV-4.csv'
harvard_inquirer_url = data_dir / 'HIV-4.csv'
harvard_inquirer_path if not harvard_inquirer_path.exists():
urlretrieve(harvard_inquirer_url, harvard_inquirer_path)
with open(harvard_inquirer_path) as f:
= list(csv.DictReader(f)) harvard_inquirer_data
We can extract the positive and negative entries
= [i['Entry'].lower() for i in harvard_inquirer_data if i['Positiv']]
positive_hi 5], positive_hi[-5:], len(positive_hi) positive_hi[:
(['abide', 'ability', 'able', 'abound', 'absolve'],
['worth-while', 'worthiness', 'worthy', 'zenith', 'zest'],
1915)
= [i['Entry'].lower() for i in harvard_inquirer_data if i['Negativ']]
negative_hi 5], negative_hi[-5:], len(negative_hi) negative_hi[:
(['abandon', 'abandonment', 'abate', 'abdicate', 'abhor'],
['wrongful', 'wrought', 'yawn', 'yearn', 'yelp'],
2291)
This has fewer ties but actually a lower accuracy than human 3.
(Technically we should use stemming with Harvard Inquirer but it won’t improve matters here)
= MatchCountClassifier(set(positive_hi), set(negative_hi))
mcc_hi
mcc_hi.fit(X, y)
print(f'''Harvard Inquirer
Ties: {mcc_hi.ties:0.0%}
Accuracy: {accuracy(mcc_hi.predict(X), y):0.0%}''')
Harvard Inquirer
Ties: 6%
Accuracy: 63%
Error Analysis
= mcc3.predict(X)
yhat = [mcc3._score(row) for row in X] scores
= [yi==yhati for yi, yhati in zip(y, yhat)]
correct mean(correct)
0.6984126984126984
= [i for i, c in enumerate(correct) if not c]
incorrect_idx len(incorrect_idx)
418
for score, count in Counter(scores[i] for i in incorrect_idx).most_common():
print(score, '\t', count, '\t', f'{count/len(incorrect_idx):0.2%}')
0 84 20.10%
-1 67 16.03%
1 61 14.59%
-2 51 12.20%
2 37 8.85%
-3 30 7.18%
-4 16 3.83%
3 13 3.11%
-5 11 2.63%
-6 10 2.39%
4 8 1.91%
5 7 1.67%
6 6 1.44%
-9 4 0.96%
-7 4 0.96%
-10 3 0.72%
-8 2 0.48%
17 1 0.24%
-15 1 0.24%
-12 1 0.24%
-11 1 0.24%
Look at most extreme cases
= [i for i in incorrect_idx if abs(scores[i]) >= 7]
very_wrong_idx
len(very_wrong_idx)
17
import html
def mark_span(text, color):
return f'<span style="background: {color};">{html.escape(text)}</span>'
def markup_html_words(text, words, color):
= '|'.join([r'\b' + re.escape(word) + r'\b' if len(word) > 1 else re.escape(word) for word in words])
word_pattern return re.sub(fr"({word_pattern})(?![^<]*>)", lambda match: mark_span(match.group(1), color), text, flags=re.IGNORECASE)
def markup_sentiment(text, positive=positive3, negative=negative3):
= markup_html_words(text, positive, "lightgreen")
text = markup_html_words(text, negative, "orange")
text return text
from IPython.display import HTML, display
def show_index(idx):
= data['movieid'][idx]
movieid print(f'Movie: {movieid}, Label: {data["label"][idx]}, Score: {scores[idx]}')
display(HTML(markup_sentiment(get_movie_html(movieid))))
This movie is labelled as negative, despite being 3 out of 4 stars.
The author is a Woody Allen fan and it wasn’t his favorite Woody Allen film but it’s still pretty good.
This is a mislabelling.
0]) show_index(very_wrong_idx[
Movie: 15970, Label: neg, Score: 17
Celebrity (1998)
reviewed by
Matt Prigge
CELEBRITY (1998) A Film Review by Ted Prigge Copyright 1998 Ted Prigge
Writer/Director: Woody Allen Starring: Kenneth Branagh, Judy Davis, Joe Mantegna, Charlize Theron, Leonardo DiCaprio, Famke Janssen, Winona Ryder, Melanie Griffith, Bebe Neuwirth, Michael Lerner, Hank Azaria, Gretchen Mol, Dylan Baker, Jeffrey Wright, Greg Mottola, Andre Gregory, Saffron Burrows, Alfred Molina, Vanessa Redgrave, Joey Buttafuoco, Mary Jo Buttafuoco, Donald Trump
After hearing reviews for Woody Allen's upteenth movie in history, "Celebrity," range from terribly boring to just so-so, my heart lept when the opening images of the film closely resembled that of "Manhattan," my personal favorite from my personal favorite director of all time. Woody Allen's films almost never rely on visual flair over textual flair, so when one of his films closely resembles the one time that these two entities fit hand-in-hand ("Manhattan" really is one of the best-looking films I've ever seen, beautiful black and white photography of the city's best areas, etc.), a fan can't help but feel visibly moved. The film opens up, with the usual credits with plain white font over black backgrounds, and an old ironic standard playing on the soundtrack, but then the screen fills with a gorgeous dull gray sky, with the word "Help" being spelled with an airplane. Beethoven's 5th blasts on the soundtrack. The city seems to stop to take notice of this moment, and it's all rather lovely to look at.
And then we cut to a film crew, shooting this as the film's hilariously banal key moment in the film, where the lead actress in the film (Melanie Griffith, looking as buxom and beautiful as ever) has to realize something's wrong with her life or whatever. It's a terribly stale scene for a Woody Allen film, with the great opening shots or without, and my heart sank and I soon got used to the fact that once again, a new film of his was not going to be as great as his past works (though, for the record, last year's "Deconstructing Harry" came awfully close).
What the hell has happened to him? The man who once could be relied on for neurotic freshness in cinema has not become less funny, but his films have become less insightful and more like he tossed them together out of unfinished ideas. "Bullets Over Broadway," though wonderful, relies on irony to pull a farce that just never totally takes off. "Mighty Aphrodite" is more full of great moments and lines than a really great story. "Everyone Says I Love You" was more of a great idea than a great film. Even "Deconstructing Harry" is admittingly cheap in a way, even if it does top as one of his most truly hilarious films.
If anything, the reception of "Celebrity" by everyone should tip Allen off to the fact that this time, it's not the audience and critics who are wrong about how wonderful his film is: it's him. "Celebrity" is, yes, a good film, but it's only marginally satisfying as a Woody Allen film. Instead of creating the great Woody Allen world, he's created a world out of a subject he knows only a bit about. And he's fashioned a film that is based almost entirely on his uninformed philosophy of celebrities, so that it plays like a series of skits with minor connections. It's like "La Dolce Vita" without the accuracy, the right amount of wit, and the correct personal crisis.
Woody, becoming more insecure in his old age, choses to drop the Woody Allen character in on the world of celebrities, and then hang him and all his flaws up for scrutiny, and does this by casting not himself but Brit actor Kenneth Branagh in the lead. Much has been said about his performance - dead on but irritating, makes one yearn for the real thing, blah blah blah - but to anyone who actually knows the Woody Allen character knows that Branagh's performance, though featuring some of the same mannerisms (stuttering, whining, lots o' hand gestures), is hardly a warts-and-all impersonation. Branagh brings along with him little of the Woody Allen charm, which actually allows for his character's flaws to be more apparent. Woody's a flawed guy, and we know it, but we love him anyway, because he's really funny and really witty and really intelligent. Branagh's Allen is a bit more flat-out bad, but with the same charm so that, yes, we like him, but we're still not sure if he's really a good person or not.
His character, Lee Simon, is first seen on the set of the aforementioned movie, hits on extra actress Winona Ryder, then goes off to interview Griffith, who takes him to her childhood home where he makes a pass at her, and she denies him...sorta. We then learn, through flashbacks, that Lee has been sucked into trying to be a celebrity thanks to a mid-life crisis and an appearance at his high school reunion. He has since quit his job as a travel journalist and become a gossip journalist of sorts, covering movie sets and places where celebrities congregate, so that he can meet them, and maybe sell his script (a bank robbery movie "but with a deep personal crisis"). As such, he has divorced his wife of several years (Allen regular Judy Davis), and continues on a quest for sexual happiness, boucing from girlfriend to girlfriend and fling to fling over the course of the film.
After Griffith comes his escapades with a model (Charlize Theron) who is "polymorphously perverse" (glad to see Allen is using new jokes, ha ha), who takes him for a wild ride not different from that of the Anita Ekberg segment of "La Dolce Vita." Following are his safe relationship with smart working woman Famke Janssen, a relationship that almost assures him success, and his continued escapades with Ryder, whom he fancies most of all. His story is juxtaposed with that of Davis, who flips out, but stumbles onto happiness when she runs into a handsome, friendly TV exec (Joe Mantegna) who lands her a job that furthers her career to national status. While Lee is fumbling about, selfishly trying to ensure his own happiness, Davis becomes happy ("I've become the kind of woman I've always hated...and I'm loving it.") without doing a thing.
The result is a film of highs and mediums. The mediums are what take up most of the film, with sitations and scenes which don't exactly work but you can't help but pat Allen on the back for trying. But other places are really great scenes. The opening. The sequence with Theron, which is so good that I wished it hadn't ended. A banana scene with Bebe Neuwirth (droll as ever). And, perhaps the best sequence: a romp with hot-as-hell teen idol, Brandon Darrow, played by none other than Leo DiCaprio, who is so un-DiCaprio-esque that if any of this fans could sit through this film, they'd never look at him the same way. He ignites the screen with intensity, and spares nothing in showing his character as narcissistically tyrannical, and totally heartbreaking for Lee, who comes to him to talk about his script that he has read, and finds himself on a wild all-day ride with him. They go to Atlantic City to watch a fight, they gamble, and they wind up in his hotel room, where Darrow gets it on with his flame (Gretchen Mol) and he lends him one of the leftover groupies. Allen's writing in these scenes are so good that just for them, I'd almost recommend the film. Almost.
But what I really liked about this film is despite the fact that it's a mess, despite the fact that what this film really needs is a good old fashioned rewrite by Allen himself, it's still a smart and insightful film. Though some of the jokes are either stale or misplaced (some seem too cartoonish, even for this environment), Allen still manages to get across that this film is not exactly about celebrities, as it may seem to be (if it were, it'd be extremely out-of-touch), but about those who want to be celebrities, and how they equate celebrity-hood with happiness. We never get close enough to the actual celebrities to see if they're really happy (they may appear to be on the surface...), but we do get close enough to Lee and Davis' character. Lee is obsessed with the phenomenon, while Davis takes is at arm's length, and never gets too involved in what it is, and soon becomes one herself.
Besides, it's witty, and it does have the one thing that no other film has but Allen's: that great Woody Allen feel. It may be not exactly fresh and lively or totally brilliant in its depiction of its subject, and yes, as a part of Woody Allen's oeuvre, it's merely a blip (no "Annie Hall" but it's no "Shadows and Fog" either), but it goes to prove that no one can make a film like him, and only he and maybe Godard could possibly take a totally horrible metaphor, like the one in the beginning, and make it work not once but twice.
MY RATING (out of 4): ***
Homepage at: http://www.geocities.com/Hollywood/Hills/8335/
The review above was posted to the
rec.arts.movies.reviews newsgroup (de.rec.film.kritiken for German reviews).
The Internet Movie Database accepts no responsibility for the contents of the
review and has no editorial control. Unless stated otherwise, the copyright
belongs to the author.
Please direct comments/criticisms of the review to relevant newsgroups.
Broken URLs inthe reviews are the responsibility of the author.
The formatting of the review is likely to differ from the original due
to ASCII to HTML conversion.
Related links: index of all rec.arts.movies.reviews reviews
Interestingly 15970 one of the “corrections” pos->neg
print()
= sentiment_fh.extractfile(sentiment_fh.getmember('diff.txt')).read().decode(CODEC).rstrip()
diff_txt print(diff_txt)
== Changes made ==
mix20_rand700_tokens_cleaned.zip
-> mix20_rand700_tokens_0211.tar.gz
Removed : (non-English/incomplete reviews)
pos/cv037_tok-11720.txt
pos/cv206_tok-12590.txt
pos/cv263_tok-10033.txt
pos/cv365_tok-21785.txt
pos/cv400_tok-11748.txt
pos/cv528_tok-12960.txt
pos/cv627_tok-14423.txt
neg/cv059_tok-8583.txt
neg/cv111_tok-11625.txt
neg/cv193_tok-28093.txt
neg/cv216_tok-27832.txt
neg/cv219_tok-11130.txt
neg/cv423_tok-10742.txt
neg/cv592_tok-10894.txt
Moved: (based on Nathan's judgement when he read the review,
sometimes different from the original author's own rating,
as listed below)
neg -> pos:
cv279_tok-23947.txt *1/2, but reads positive
cv346_tok-24609.txt misclassification
cv375_tok-0514.txt misclassification
cv389_tok-8969.txt misclassification
cv425_tok-8417.txt several reviews together
cv518_tok-11610.txt misclassification
pos -> neg:
cv017_tok-29801.txt *** Average, hits and misses
cv352_tok-15970.txt (out of 4): ***
cv375_tok-21437.txt * * * - Okay movie, hits and misses
cv377_tok-7572.txt *** Pretty good, bring a friend
cv546_tok-23965.txt * * * - Okay movie, hits and misses
A lot of the other examples were based on repeated punctuation (exclamation marks and question marks)
for word in data['text'][very_wrong_idx[1]].split() if word in positive3 + negative3) Counter(word
Counter({'bad': 1,
'stupid': 1,
'worst': 2,
'!': 9,
'great': 2,
'?': 6,
'still': 2})
for word in data['text'][very_wrong_idx[2]].split() if word in positive3 + negative3) Counter(word
Counter({'!': 3, '?': 6, 'bad': 1, 'great': 1})
for word in data['text'][very_wrong_idx[3]].split() if word in positive3 + negative3) Counter(word
Counter({'?': 9, 'bad': 1})
for word in data['text'][very_wrong_idx[4]].split() if word in positive3 + negative3) Counter(word
Counter({'?': 11, 'love': 2, 'bad': 1})
Machine Learning Methods
We’ll now use traditional machine learning methods. For showing the different methods we’ll use a small vocabulary from the human baseline.
To keep this section reasonable length we’ll use sklearn
implementations of the methods.
= positive3 + negative3
vocab vocab
['love',
'wonderful',
'best',
'great',
'superb',
'still',
'beautiful',
'bad',
'worst',
'stupid',
'waste',
'boring',
'?',
'!']
We’ll create a Feature vector from this vocabulary for each document
= [Counter(word for word in doc if word in vocab) for doc in X]
word_counts
= [[row[word] for word in vocab] for row in word_counts]
X_feature
dict(zip(vocab, X_feature[0]))
{'love': 0,
'wonderful': 0,
'best': 0,
'great': 1,
'superb': 0,
'still': 1,
'beautiful': 0,
'bad': 0,
'worst': 0,
'stupid': 0,
'waste': 0,
'boring': 0,
'?': 1,
'!': 0}
And split it into train and test sets by the cvid
= [row for row, cvid in zip(X_feature, data['cvid']) if int(cvid) // 233 < 2]
X_train = [row for row, cvid in zip(X_feature, data['cvid']) if int(cvid) // 233 >= 2]
X_test
= [row for row, cvid in zip(y, data['cvid']) if int(cvid) // 233 < 2]
y_train = [row for row, cvid in zip(y, data['cvid']) if int(cvid) // 233 >= 2]
y_test
len(X_train), len(X_test), len(y_train), len(y_test)
(921, 465, 921, 465)
Naive Bayes
The text states they use Naive Bayes with add-1 smoothing (so in sklearn alpha=1.0
):
from sklearn.naive_bayes import MultinomialNB
= MultinomialNB(alpha=1.0)
nb
nb.fit(X_train, y_train)
accuracy(nb.predict(X_test), y_test)
0.7161290322580646
There is another way to do Naive Bayes where each word is taken as an independent feature, but it tends to be worse for NLP.
from sklearn.naive_bayes import BernoulliNB
= BernoulliNB(binarize=1.0, alpha=1.0)
nb
nb.fit(X_train, y_train) accuracy(nb.predict(X_test), y_test)
0.6666666666666666
= MultinomialNB(alpha=1.0) nb
Maximum Entropy
Maximum Entropy is an old NLP term for Logistic Regression
We use ten iterations of the improved iterative scaling algorithm (Della Pietra et al., 1997) for parameter training (this was a sufficient number of iterations for convergence of training-data accuracy), together with a Gaussian prior to prevent overfitting (Chen and Rosenfeld, 2000).
A Gaussian Prior is equivalent to an L2 penalty, but they don’t specify the size of the prior and I can’t access the referenced paper. I’ll stick to the default in sklearn of 1.0
(the solver shouldn’t matter much).
from sklearn.linear_model import LogisticRegression
= LogisticRegression(penalty='l2', solver='liblinear', C=1.0)
me
me.fit(X_train, y_train) accuracy(me.predict(X_test), y_test)
0.6946236559139785
Note that the amount of regularization can actually matter. Ideally we’d keep a small holdout set for hyperparameter tuning, but I’ll stick to the methods in the original.
= LogisticRegression(penalty='l2', solver='liblinear', C=0.5)
me
me.fit(X_train, y_train) accuracy(me.predict(X_test), y_test)
0.7075268817204301
= LogisticRegression(penalty='l2', solver='liblinear', C=1.0) me
Support Vector Machines
We used Joachim’s (1999) SVM light package for training and testing, with all parameters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly).
There’s little detail as to the hyperparameters again, so I’ll use the default.
from sklearn.svm import LinearSVC
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
= Pipeline([('norm', Normalizer(norm='l2')),
svm 'svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))])
(
svm.fit(X_train, y_train) accuracy(svm.predict(X_test), y_test)
0.7290322580645161
I’m not clear on what length-normalizing is, but L2 normalizing looks like it works better than L1
= Pipeline([('norm', Normalizer(norm='l1')),
svm 'svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))])
(
svm.fit(X_train, y_train) accuracy(svm.predict(X_test), y_test)
0.7118279569892473
And this works better than not normalizing it at all
= LinearSVC(penalty='l2', loss='squared_hinge', C=1.0)
svm
svm.fit(X_train, y_train) accuracy(svm.predict(X_test), y_test)
/home/eross/mambaforge/envs/pang_lee_2003/lib/python3.8/site-packages/sklearn/svm/_base.py:1244: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(
0.6967741935483871
As with Maximum Entropy it’s sensitive to the amount of regularizaiton.
= Pipeline([('norm', Normalizer(norm='l2')),
svm 'svc', LinearSVC(penalty='l2', loss='squared_hinge', C=2.0))])
(
svm.fit(X_train, y_train) accuracy(svm.predict(X_test), y_test)
0.7311827956989247
We’ll reset everything to the defaults
= Pipeline([('norm', Normalizer(norm='l2')),
svm 'svc', LinearSVC(penalty='l2', loss='squared_hinge', C=1.0))]) (
Evaluation
Experimental Set-up
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
we randomly selected 700 positive-sentiment and 700 negative-sentiment documents
Counter(y)
Counter({0: 692, 1: 694})
We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold.
= [[idx for idx, cvid in enumerate(data['cvid']) if int(cvid) // 233 == i] for i in range(3)]
folds
len(f) for f in folds] [
[459, 462, 463]
= [(folds[0] + folds[1], folds[2]),
cv 0] + folds[2], folds[1]),
(folds[1] + folds[2], folds[0]),
(folds[ ]
One unconventional step we took was to attempt to model the potentially important contextual effect of negation: clearly “good” and “not very good” indicate opposite sentiment orientations. Adapting a technique of Das and Chen (2001), we added the tag NOT to every word between a negation word (“not”, “isn’t”, “didn’t”, etc.) and the first punctuation mark following the negation word.
To do this we can first get the negation words with a surprisingly effective heuristic:
= Counter(word for text in X for word in text)
words
= [w for (w,c) in words.most_common(10_000) if re.match(".*n[o']t$", w)]
negation_words negation_words
['not',
"doesn't",
"don't",
"isn't",
"can't",
"didn't",
"wasn't",
"aren't",
"won't",
"couldn't",
"wouldn't",
'cannot',
"haven't",
"hasn't",
"weren't",
"shouldn't",
"ain't",
"hadn't"]
Then we can use a simple finite state machine to add the negation tokens.
= '!?.,()[];:,"'
punctuation
def negation_mark(x):
return 'NOT_' + x
def add_negation_tag(tokens, negation_words=negation_words, punctuation=punctuation, negation_mark=negation_mark):
= False
in_negation = []
tagged_tokens for token in tokens:
if token in negation_words:
= not in_negation
in_negation elif token in punctuation:
= False
in_negation elif in_negation:
= negation_mark(token)
token
tagged_tokens.append(token)
return tagged_tokens
def text_add_negation_tag(s: str, **kwargs) -> str:
return ' '.join(add_negation_tag(tokens=s.split(), **kwargs))
"this isn't a great movie , it is terrible") text_add_negation_tag(
"this isn't NOT_a NOT_great NOT_movie , it is terrible"
For this study, we focused on features based on unigrams (with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at least four times in our 1400-document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams, since we consider bigrams (and n-grams in general) to be an orthogonal way to incorporate context.
There’s a slight issue here in using the vocabulary based on the entire dataset; the features should only be selected by the data in each fold otherwise you could be overfitting.
We’ll be making lots of this so we’ll make a factory function to remove some of the boilerplate.
def make_count_vectorizer(
=16165,
max_featuresinput='content',
=r"[^ ]+",
token_pattern=(1,1),
ngram_range=None,
preprocessor=True):
binaryreturn CountVectorizer(input=input,
=token_pattern,
token_pattern=ngram_range,
ngram_range=preprocessor,
preprocessor=max_features,
max_features=binary)
binary
= make_count_vectorizer(preprocessor=text_add_negation_tag, binary=False)
unigram_freq_vectorizer
= make_count_vectorizer(preprocessor=text_add_negation_tag)
unigram_vectorizer
= make_count_vectorizer(ngram_range=(2,2)) bigram_vectorizer
Results
From the README there’s an updated Figure 3
Features | # features | NB | ME | SVM | |
---|---|---|---|---|---|
1 | unigrams (freq.) | 16162 | 79.0 | n/a | 73.0 |
2 | unigrams | 16162 | 81.0 | 80.2 | 82.9 |
3 | unigrams+bigrams | 32324 | 80.7 | 80.7 | 82.8 |
4 | bigrams | 16162 | 77.3 | 77.5 | 76.5 |
5 | unigrams+POS | 16688 | 81.3 | 80.3 | 82.0 |
6 | adjectives | 2631 | 76.6 | 77.6 | 75.3 |
7 | top 2631 unigrams | 2631 | 80.9 | 81.3 | 81.2 |
8 | unigrams+position | 22407 | 80.8 | 79.8 | 81.8 |
For each feature function (row in the table) we want to apply it followed by each of the 3 types of classifiers
All results reported below, as well as the baseline results from Section 4, are the average three-fold cross-validation results on this data.
We can use cross_val_score
to do the cross-validation.
= {
models 'NB': nb,
'ME': me,
'SVM': svm
}
= {}
results
def evaluate(vectorizer):
= {}
results for model_name, model in models.items():
= Pipeline(steps=[('Vectorizer', vectorizer), ('model', model)])
pipeline = cross_val_score(pipeline, data['text'], y, cv=cv, scoring='accuracy')
cv_score = cv_score.mean()
results[model_name] return results
Initial Unigram Results
We can show the results to compare with the table
def display_results(results):
print('\t'.join(results))
print('\t'.join([f'{v:0.1%}' for v in results.values()]))
def display_results(results):
= '<table><tr>' + ''.join([f'<th>{r}</th>' for r in results]) + '</tr><tr>' + \
result_html ''.join([f'<td>{v:0.1%}</td>' for v in results.values()]) + '</tr></table>'
display(HTML(result_html))
This does a little better than the table from the README, which is surprising especially for Naive Bayes which is deterministic with no tunable parameters.
'unigram_freq'] = evaluate(unigram_freq_vectorizer)
results[
'unigram_freq']) display_results(results[
NB | ME | SVM |
80.3% | 80.5% | 77.0% |
Feature frequency vs. presence
As in the paper using presence features gives a significant lift, but again all the accuracies are much higher here. In particular Maximum Entropy (Logistic Regression) is higher than Naive Bayes here and much closer to SVM, which suggests there could be poor regularisation in the original.
'unigram'] = evaluate(unigram_vectorizer)
results[
'unigram']) display_results(results[
NB | ME | SVM |
81.9% | 84.2% | 84.4% |
Bigrams
Line (3) of the results table shows that bigram information does not improve performance beyond that of unigram presence, although adding in the bigrams does not seriously impact the results, even for Naive Bayes
This is still observed here, but again our results are slightly higher than the authors.
'unigram_bigram'] = evaluate(FeatureUnion([('unigram', unigram_vectorizer), ('bigram', bigram_vectorizer)]))
results[
'unigram_bigram']) display_results(results[
NB | ME | SVM |
80.7% | 83.8% | 84.4% |
We have similar observations for bigrams:
However, comparing line (4) to line (2) shows that relying just on bigrams causes accuracy to decline by as much as 5.8 percentage points.
'bigram'] = evaluate(bigram_vectorizer)
results[
'bigram']) display_results(results[
NB | ME | SVM |
77.6% | 78.0% | 78.3% |
We see a marginally larger decline
f"{results['bigram'][k] - results['unigram'][k] :0.2%}" for k in results['bigram']} {k:
{'NB': '-4.26%', 'ME': '-6.14%', 'SVM': '-6.07%'}
Parts of speech
We also experimented with appending POS tags to every word via Oliver Mason’s Qtag program
Unfortunately I can’t access QTag, but instead will use a much more modern (and likely more accurate) Averaged Perceptron Tagger from NLTK. It’s relatively expensive so I’ll cache the calculations across models.
import nltk
from functools import lru_cache
'averaged_perceptron_tagger')
nltk.download(
@lru_cache(maxsize=None)
def pos_tag(doc):
return nltk.pos_tag(doc.split())
'The quick brown fox jumped over the lazy dog.') pos_tag(
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/eross/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[('The', 'DT'),
('quick', 'JJ'),
('brown', 'NN'),
('fox', 'NN'),
('jumped', 'VBD'),
('over', 'IN'),
('the', 'DT'),
('lazy', 'JJ'),
('dog.', 'NN')]
We can then append the tags as follows:
def pos_marker(doc):
return ' '.join([token + '_' + tag for token, tag in pos_tag(doc)])
'text'][0]).split()[:10] pos_marker(data[
['united_JJ',
'states_NNS',
',_,',
'1998_CD',
'u_NN',
'._.',
's_NN',
'._.',
'release_NN',
'date_NN']
However, the effect of this information seems to be a wash: as depicted in line (5) of Figure 3, the accuracy improves slightly for Naive Bayes but declines for SVMs, and the performance of MaxEnt is unchanged.
We actually get worse results across the board
= make_count_vectorizer(preprocessor=pos_marker)
unigram_pos_vectorizer
'unigram_pos'] = evaluate(unigram_pos_vectorizer)
results[
'unigram_pos']) display_results(results[
NB | ME | SVM |
80.6% | 82.4% | 82.5% |
'unigram']) display_results(results[
NB | ME | SVM |
81.9% | 84.2% | 84.4% |
It’s not the effect of dropping negations either(in fact that makes Naive Bayes and SVM do slightly better)
= make_count_vectorizer(preprocessor=None)
unigram_no_negation_vectorizer
display_results(evaluate(unigram_no_negation_vectorizer))
NB | ME | SVM |
82.2% | 82.9% | 83.3% |
Since adjectives have been a focus of previous work in sentiment detection (Hatzivassiloglou and Wiebe, 2000; Turney, 2002), we looked at the performance of using adjectives alone.
def adjective_extractor(doc):
= [token for token, tag in pos_tag(doc) if tag == 'JJ']
adjectives return ' '.join(adjectives)
'text'][0]) adjective_extractor(data[
"united wide mpaa male full-frontal female graphic sexual frequent theatrical daphne murray liber kimball wild dreary early frontal late early lucrative mtv fast-paced slick flashy mindless wild first eleven convincing much flesh narrative wild real increasingly- improbable predictable serpentine idiotic steven seagal easy unlikely right wrong good next hot young old screen i'll film's erotic impressive wild risqué soft-core generic much lesbian token iron-clad film's full kevin few fully-clothed thirteenth titanic familiar wild john last finely-tuned psychological normal copious real powerful difficult same i mainstream previous wide-release mad box-office quick pretty main wild occasional sam matt florida's blue high curvaceous van theresa daphne skeptical suzie neve similar wild isn't good much ludicrous triple-digit on-screen nice slutty see-through one-piece kevin only interesting right secret wild absurd basic kinetic wild great superficial trash"
Yet, the results, shown in line (6) of Figure 3, are relatively poor: the 2633 adjectives provide less useful information than unigram presence.
= make_count_vectorizer(preprocessor=lambda x: adjective_extractor(text_add_negation_tag(x)),
adjective_vectorizer =2633)
max_features
'adjective'] = evaluate(adjective_vectorizer)
results[
'adjective']) display_results(results[
NB | ME | SVM |
77.2% | 76.2% | 76.2% |
Indeed, line (7) shows that simply using the 2633 most frequent unigrams is a better choice, yielding performance comparable to that of using (the presence of) all 16165 (line (2)).
= make_count_vectorizer(preprocessor=text_add_negation_tag, max_features=2633)
unigram_2633_vectorizer
'unigram_2633'] = evaluate(unigram_2633_vectorizer)
results[
'unigram_2633']) display_results(results[
NB | ME | SVM |
81.0% | 81.9% | 83.2% |
Position
As a rough approximation to determining this kind of structure, we tagged each word according to whether it appeared in the first quarter, last quarter, or middle half of the document.
def quartile_marker(doc):
= doc.split()
tokens = len(tokens) // 4
quartile = [ token + f'_{i // quartile}' for i, token in enumerate(tokens)]
output_tokens return ' '.join(output_tokens)
'this is an example text with several words in it') quartile_marker(
'this_0 is_0 an_1 example_1 text_2 with_2 several_3 words_3 in_4 it_4'
I suspect they kept the top vocabulary across the whole document, but we will generate it by quartile.
The results (line (8)) didn’t differ greatly from using unigrams alone, but more refined notions of position might be more successful.
= make_count_vectorizer(preprocessor=quartile_marker, max_features=22430)
position_vectorizer
'position'] = evaluate(position_vectorizer)
results[
'position']) display_results(results[
NB | ME | SVM |
78.7% | 79.8% | 80.1% |
Let’s see all our results
import pandas as pd
format('{:0.1%}') pd.DataFrame(results).T.style.
NB | ME | SVM | |
---|---|---|---|
unigram_freq | 80.3% | 80.5% | 77.0% |
unigram | 81.9% | 84.2% | 84.4% |
unigram_bigram | 80.7% | 83.8% | 84.4% |
bigram | 77.6% | 78.0% | 78.3% |
unigram_pos | 80.6% | 82.4% | 82.5% |
adjective | 77.2% | 76.2% | 76.2% |
unigram_2633 | 81.0% | 81.9% | 83.2% |
position | 78.7% | 79.8% | 80.1% |
Which is comparable to the original table
Features | # features | NB | ME | SVM | |
---|---|---|---|---|---|
1 | unigrams (freq.) | 16162 | 79.0 | n/a | 73.0 |
2 | unigrams | 16162 | 81.0 | 80.2 | 82.9 |
3 | unigrams+bigrams | 32324 | 80.7 | 80.7 | 82.8 |
4 | bigrams | 16162 | 77.3 | 77.5 | 76.5 |
5 | unigrams+POS | 16688 | 81.3 | 80.3 | 82.0 |
6 | adjectives | 2631 | 76.6 | 77.6 | 75.3 |
7 | top 2631 unigrams | 2631 | 80.9 | 81.3 | 81.2 |
8 | unigrams+position | 22407 | 80.8 | 79.8 | 81.8 |
Discussion
The conclustions from the original still hold up:
The results produced via machine learning techniques are quite good in comparison to the human-generated baselines discussed in Section 4. In terms of relative performance, Naive Bayes tends to do the worst and SVMs tend to do the best, although the differences aren’t very large. On the other hand, we were not able to achieve accuracies on the sentiment classification problem comparable to those reported for standard topic-based categorization, despite the several different types of features we tried. Unigram presence information turned out to be the most effective; in fact, none of the alternative features we employed provided consistently better performance once unigram presence was incorporated.
They also give this analysis:
As it turns out, a common phenomenon in the documents was a kind of “thwarted expectations” narrative, where the author sets up a deliberate contrast to earlier discussion.
We now have much more advanced methods, in particular neural methods that can take the context into account.
On compute
This paper was relatively easy to reproduce because computers are so much faster than when they wrote it, and software is so much better.
It’s worth keeping in mind the change in technology in the last 20 years; the fastest supercomputer in the world at the time, the Earth Simulator could perform just under 36 TFLOPS, about the same as 4 NVIDIA A100’s that could be rented today in the cloud for under $5 an hour (AWS was just starting up around this time).
For a more grounded comparison in 2003 Apple released the Power Mac G5 which at the time was a powerful consumer computer, but some benchmarks from 2017 show it’s around 10-100 times slower and a 7th Generation Intel i7. It had 256MB of RAM, where a mid-range laptop today would have at least 8GB, about 30 times more. This meant recalculating the features 9 times (once per split and per model) was a reasonable thing to do, but would have been very unreasonable at the time.
Also software has come a long way, Scikit-Learn started in 2007 and made all of the feature fitting trivial, that at the time would have involved plugging together different systems (likely in C and Java).
It’s interesting to think what things may be like in another 20 years.