import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
import random
import csv
from pathlib import Path
from collections import Counter
from tqdm.auto import trange, tqdm
Makemore Subreddits - Part 2 Multilayer Perceptron
Let’s create a Multi-Layer Perceptron language model for subreddit names.
This loosely follows part 2 of Andrej Karpathy’s excellent makemore; go and check that out first. However he used a list of US names, where we’re going to use subreddit names. We’ll follow the model in Bengio et al. 2003 of building a simple 2 layer MLP on a fixed length window, but instead of using words we will use characters.
This is a Jupyter notebook you can download the notebook or view it on Kaggle.
Loading the Data
This is largely similar to Part 1 where we get the most common subreddit names from All Subreddits and Relations Between Them.
Filter to subreddits that:
- Have at least 1000 subscribers
- Are not archived
- Are safe for work
- And are not quarintined
= Path('/kaggle/input/all-subreddits-and-relations-between-them')
data_path = 1_000
min_subscribers
with open(data_path / 'subreddits.csv', 'r') as f:
= [d['name'] for d in csv.DictReader(f)
names if int(d['subscribers'] or 0) >= min_subscribers
and d['description']
and d['type'] != 'archived'
and d['nsfw'] == 'f'
and d['quarantined'] == 'f']
len(names)
42)
random.seed( random.shuffle(names)
Because we are using a more powerful model we will make a train/val/test split at 80%/10%/10% to identify overfitting.
= len(names)
N
= names[:int(0.8*N)]
names_train = names[int(0.8*N):int(0.9*N)]
names_val = names[int(0.9*N):]
names_test
len(names_train), len(names_val), len(names_test)
(26876, 3359, 3360)
The names are largely human readable.
for name in names_train[:10]:
print(name)
splunk
thenwa
soylent
factorio
christinaricci
blues
vegancheesemaking
goldredditsays
reformed
nagoya
Compile the Data
Now convert the dataset into something that the model can easily work with. First represent all the character tokens as consecutive integers. We create a special PAD_CHAR
with index 0 to represent tokens outside of the sequence.
= '.'
PAD_CHAR = 0
PAD_IDX
= sorted(set(''.join(names_train)))
i2s assert PAD_CHAR not in i2s
i2s.insert(PAD_IDX, PAD_CHAR)
= {s:i for i, s in enumerate(i2s)}
s2i
= len(i2s)
V
print(i2s)
['.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Then we want to predict the next character given the previous block_size
characters.
For example with 3 characters of context we want to create x
that contains consecutive strings of 3 characters, and y
that contains the fourth character.
= '...abcdef.'
padded_name
for *x, y in zip(*[padded_name[i:] for i in range(4)]):
print(x, y)
['.', '.', '.'] a
['.', '.', 'a'] b
['.', 'a', 'b'] c
['a', 'b', 'c'] d
['b', 'c', 'd'] e
['c', 'd', 'e'] f
['d', 'e', 'f'] .
We then apply this process to each of our data splits.
= 3
block_size
def compile_dataset(names, block_size, PAD_CHAR=PAD_CHAR, s2i=s2i):
= [], []
X, y for name in names:
= PAD_CHAR * block_size + name + PAD_CHAR
padded_name = [s2i[c] for c in padded_name]
padded_tokens for *context, target in zip(*[padded_tokens[i:] for i in range(block_size+1)]):
X.append(context)
y.append(target)return torch.tensor(X), torch.tensor(y)
# Note about vocab
= compile_dataset(names_train, block_size)
X, y = compile_dataset(names_val, block_size)
X_val, y_val = compile_dataset(names_test, block_size)
X_test, y_test
X.shape, y.shape
(torch.Size([330143, 3]), torch.Size([330143]))
Then the task is to predict the next token given the context, where .
in X
represents before the start of a string, and in y
represents the end of a string.
print(names_train[:3])
print()
for xi, yi in zip(X[:20], y[:20]):
print([i2s[t] for t in xi], i2s[yi])
['splunk', 'thenwa', 'soylent']
['.', '.', '.'] s
['.', '.', 's'] p
['.', 's', 'p'] l
['s', 'p', 'l'] u
['p', 'l', 'u'] n
['l', 'u', 'n'] k
['u', 'n', 'k'] .
['.', '.', '.'] t
['.', '.', 't'] h
['.', 't', 'h'] e
['t', 'h', 'e'] n
['h', 'e', 'n'] w
['e', 'n', 'w'] a
['n', 'w', 'a'] .
['.', '.', '.'] s
['.', '.', 's'] o
['.', 's', 'o'] y
['s', 'o', 'y'] l
['o', 'y', 'l'] e
['y', 'l', 'e'] n
We can also look at this in terms of the underlying numeric representation.
20], y[:20].unsqueeze(1)], axis=1) torch.cat([X[:
tensor([[ 0, 0, 0, 30],
[ 0, 0, 30, 27],
[ 0, 30, 27, 23],
[30, 27, 23, 32],
[27, 23, 32, 25],
[23, 32, 25, 22],
[32, 25, 22, 0],
[ 0, 0, 0, 31],
[ 0, 0, 31, 19],
[ 0, 31, 19, 16],
[31, 19, 16, 25],
[19, 16, 25, 34],
[16, 25, 34, 12],
[25, 34, 12, 0],
[ 0, 0, 0, 30],
[ 0, 0, 30, 26],
[ 0, 30, 26, 36],
[30, 26, 36, 23],
[26, 36, 23, 16],
[36, 23, 16, 25]])
Multi-Layer Perceptron
We will follow the basic model of Bengio., et al (2003).
- Each token is embedded by a \(\left | V \right| \times m\) dimensional matrix
C
- The consecutive tokens are then concatenated together across the block size
n
to get \(x = (C(w_{t−1}),C(w_{t−2}), · · · ,C(w+{t−n+1}))\) - This is passed through two hidden layers, with an optional skip connection \(y = b +W x +U \tanh(d + Hx)\)
We’ll ignore the skip connection for now and build the basic model
# Character embedding dimension
= 2
m # Hidden layer dimension
= 100
h
# Word embedding layer
= torch.randn(V, m)
C
# First hidden layer
= torch.randn(block_size * m, h)
H = torch.randn(h)
d
# Second hidden layer
= torch.randn(h, V)
U = torch.randn(V)
b
= [C, H, d, U, b]
parameters
for p in parameters], [V*m, block_size*m*h, h, h*V, V] [p.numel()
([76, 600, 100, 3800, 38], [76, 600, 100, 3800, 38])
for p in parameters:
= True p.requires_grad
Let’s go through the model step by step with a small minibatch of size B
(here 4)
= 4
batch_size
= list(range(batch_size))
idx idx
[0, 1, 2, 3]
First we get the features and targets:
X[idx], y[idx]
(tensor([[ 0, 0, 0],
[ 0, 0, 30],
[ 0, 30, 27],
[30, 27, 23]]),
tensor([30, 27, 23, 32]))
We use the embedding matrix to get the corresponding \(B \times n \times m\) token embedding matrix where here n = block_size = 3
= C[X[idx]]
embeddings embeddings.shape
torch.Size([4, 3, 2])
We want to then map it to a hidden layer, so we mix together the dimensions using an affine transformation.
Note that using view we can consider the embeddings as a rank-2 tensor of dimension \(B \times (m \times n)\). We then get output a \(B \times h\) matrix where here h=100
is the hidden size.
= embeddings.view(batch_size, block_size * m) @ H + d
hidden_layer hidden_layer.shape
torch.Size([4, 100])
We then pass it through the tanh
activation function, and another affine transformation to get the logits.
= torch.tanh(hidden_layer) @ U + b
output_logits output_logits.shape
torch.Size([4, 38])
Finally performing a softmax gets the probabilities.
= output_logits.exp()
output_probs /= output_probs.sum(axis=1, keepdim=True)
output_probs
output_probs
tensor([[4.7666e-01, 1.2071e-04, 1.8991e-02, 7.3004e-08, 1.1990e-08, 5.2684e-05,
1.3352e-05, 3.5838e-06, 3.7178e-15, 1.6968e-11, 1.4544e-09, 2.2657e-04,
4.4790e-10, 1.7938e-09, 4.2694e-01, 1.4668e-07, 2.2485e-05, 1.0456e-09,
2.1118e-03, 1.3098e-08, 1.9163e-06, 9.6630e-10, 2.3439e-04, 1.4318e-02,
7.5780e-03, 6.0088e-04, 1.6348e-02, 1.6981e-05, 1.0376e-05, 2.1339e-05,
3.0579e-02, 4.3965e-04, 4.7099e-03, 2.4874e-11, 1.2188e-10, 6.3377e-08,
1.1395e-10, 3.1587e-14],
[1.1188e-09, 1.2986e-04, 1.9391e-03, 7.5456e-09, 1.1886e-10, 6.7563e-13,
5.2620e-06, 6.4061e-14, 3.2493e-09, 1.7278e-06, 3.8212e-09, 2.2627e-05,
2.7201e-16, 3.4729e-18, 3.2816e-03, 6.6333e-11, 3.6413e-10, 9.9206e-01,
7.0343e-11, 2.7351e-04, 1.2310e-12, 3.1674e-11, 3.2047e-08, 2.2359e-03,
6.8000e-10, 3.7111e-08, 1.6103e-05, 1.5626e-08, 1.3752e-06, 1.9135e-05,
5.5192e-12, 2.6339e-09, 1.5040e-05, 4.9836e-07, 8.6381e-17, 3.9047e-10,
3.8108e-13, 5.8600e-08],
[4.4304e-09, 1.0459e-10, 6.1763e-08, 6.5661e-12, 1.0469e-08, 1.5785e-07,
3.9861e-15, 1.0475e-12, 1.3482e-14, 9.6559e-17, 1.9139e-14, 1.5562e-08,
6.3912e-16, 1.0933e-12, 1.3146e-07, 6.4761e-15, 6.0490e-07, 6.5004e-10,
1.5227e-10, 5.7100e-07, 1.9665e-03, 1.2887e-19, 2.1500e-06, 5.8651e-05,
6.3402e-12, 4.1957e-05, 9.9793e-01, 7.1540e-07, 5.7883e-11, 1.9960e-07,
8.7884e-07, 7.9620e-08, 3.1533e-10, 1.0810e-15, 1.4883e-15, 7.5102e-08,
1.7920e-15, 1.2474e-14],
[1.5047e-06, 8.9029e-05, 1.5423e-06, 1.3484e-12, 5.4582e-16, 2.9394e-09,
5.0550e-11, 3.4188e-13, 2.3132e-16, 1.4139e-13, 8.3162e-18, 2.3549e-10,
3.5227e-17, 2.5157e-13, 7.2055e-05, 1.2128e-11, 3.1486e-12, 2.8336e-09,
1.3514e-06, 1.2729e-11, 8.4117e-09, 1.2549e-15, 1.8163e-15, 3.9356e-05,
2.0303e-10, 1.2476e-10, 9.9980e-01, 6.4277e-10, 3.8363e-17, 3.2836e-16,
5.7412e-15, 4.6959e-11, 4.9360e-11, 4.2413e-08, 9.7805e-14, 1.9071e-15,
6.9937e-14, 5.2302e-10]], grad_fn=<DivBackward0>)
Then we lookup the probability of the true indices.
y[idx]
tensor([30, 27, 23, 32])
= output_probs[torch.arange(len(idx)), y[idx]]
pred_prob pred_prob
tensor([3.0579e-02, 1.5626e-08, 5.8651e-05, 4.9360e-11],
grad_fn=<IndexBackward0>)
The loss is then the mean negative log likelihood.
= -pred_prob.log().mean()
loss loss
tensor(13.7344, grad_fn=<NegBackward0>)
Though this can be done more efficiently and simply using the PyTorch’s cross_entropy function.
= F.cross_entropy(input=output_logits, target=y[idx])
loss loss
tensor(13.7344, grad_fn=<NllLossBackward0>)
Then we can run back-propagation and make a gradient descent step.
loss.backward()
= 0.01
lr
for p in parameters:
-= lr * p.grad p.data
Running it again the loss is a little lower.
= C[X[idx]]
embeddings = embeddings.view(batch_size, block_size * m) @ H + d
hidden_layer = torch.tanh(hidden_layer) @ U + b
output_logits = F.cross_entropy(input=output_logits, target=y[idx])
loss loss.item()
12.559901237487793
Training the model
Let’s put this all together to make it easy to experiment with different MLPs.
It’s worth commenting on the norm
function; this gives the average norm of certain parameters for weight decay.
class MLP:
def __init__(self, m=2, h=100, V=V, block_size=block_size):
self.m = m
self.h = h
self.V = V
self.block_size = block_size
# Word embedding layer
self.C = torch.randn(V, m)
# First hidden layer
self.H = torch.randn(block_size * m, h)
self.d = torch.randn(h)
# Second hidden layer
self.U = torch.randn(h, V)
self.b = torch.randn(V)
def norm(self, params=None):
if params is None:
= [self.C, self.H, self.U]
params return sum((p**2).sum() for p in params) / sum(p.numel() for p in params)
def parameters(self):
return [self.C, self.H, self.d, self.U, self.b]
def requires_grad_(self, requires_grad=True):
for p in self.parameters():
p.requires_grad_(requires_grad)return self
def zero_grad(self):
for p in self.parameters():
= None
p.grad return self
def forward(self, X):
= self.C[X]
embeddings = embeddings.view(X.shape[0], self.block_size * self.m) @ self.H + self.d
hidden_layer = torch.tanh(hidden_layer) @ self.U + self.b
output_logits return output_logits
def __call__(self, X):
return self.forward(X)
Note that our initialisation isn’t very good. We’re getting much worse loss than a random guess.
input=MLP(20, 100)(X), target=y), -torch.log(torch.tensor(1/V)), F.cross_entropy(input=torch.ones(len(y), V) / V, target=y) F.cross_entropy(
(tensor(21.3535), tensor(3.6376), tensor(3.6376))
The MLP can be trained with a basic training loop that returns the losses at each step, and the validation losses at several steps (it’s too expensive to calculate every step).
def train(model, n_step, lr, wd=0.0, batch_size=32, val_step=100, X=X, y=y, X_val=X_val, y_val=y_val):
= [], []
losses, val_losses
for step in trange(n_step):
= torch.randint(0, len(X), (batch_size,))
idx
model.zero_grad()= model(X[idx])
logits = F.cross_entropy(input=logits, target=y[idx]) - wd * mlp.norm()
loss
losses.append((step, loss.item()))
loss.backward()
for p in model.parameters():
-= p.grad * lr(step, n_step)
p.data
if step % val_step == 0:
= F.cross_entropy(model(X_val), y_val).item()
val_loss
val_losses.append((step, val_loss))
return losses, val_losses
To make sure this works let’s take a small slice of the dataset with no weight decay and make sure it’s overfitting with a large number of parameters.
The final loss is quite small.
= MLP(30, 200).requires_grad_()
mlp
= 200
limit
= train(mlp, 10_000, lr=lambda _step, _n_step: 0.1, X=X[:limit], y=y[:limit])
losses, val_losses
*zip(*losses))
plt.plot(
'log')
plt.yscale(
F.cross_entropy(mlp(X[:limit]), y[:limit]).item(), F.cross_entropy(mlp(X_val), y_val).item()
(0.338957816362381, 20.37681770324707)
Look at the underlying data
= ''.join([i2s[a] for a in y[:limit]]).split('.')
training_data for d in training_data:
print(d)
splunk
thenwa
soylent
factorio
christinaricci
blues
vegancheesemaking
goldredditsays
reformed
nagoya
accidentalcosplay
tearsofthekingdom
jameswebb
blazblue
emilyatack
infjmemes
sarahstephens
bengalila
Random samples are almost entirely from that data
def sample(mlp, pad_idx=PAD_IDX, block_size=block_size, i2s=i2s, generator=None):
= []
ans = torch.tensor([[pad_idx] * block_size])
state while True:
= mlp(state).softmax(axis=1)
probs = torch.multinomial(mlp(state).softmax(axis=1), 1, generator=generator)
next_idx = torch.concat([state, next_idx], axis=1)[:,1:]
state
= next_idx[0,0].item()
next_idx if next_idx == pad_idx:
return ''.join(ans)
ans.append(i2s[next_idx])
= torch.Generator().manual_seed(241738025913704212)
g = [sample(mlp, generator=g) for _ in range(20)]
samples for s in sorted(samples):
print(s)
accident
accident
accidentalcosplay
bengalilaro9y8_5g6
bengalilaro9yrjecdc
bengalilaro9yrjecdclwenin4ozyylentalcosplunk
bengalilaro9yrjecdcqt1e4e1
bengalilaroljsemakingdom
blues
blues
christinaricci
christinaricci
christinariccident
factorio
james
james
sarahstephens
splay
tearsoftheking
thens
In fact most of the quadruplets are from the training data, as you would expect in an overfit model.
= []
chars for d in training_data:
= '.' * block_size + d + '.'
padded_tokens += list(map(''.join, zip(*[padded_tokens[i:] for i in range(block_size+1)])))
chars
= []
sample_chars for d in samples:
= '.' * block_size + d + '.'
padded_tokens += list(map(''.join, zip(*[padded_tokens[i:] for i in range(block_size+1)])))
sample_chars
f'{sum(s in chars for s in sample_chars) / len(sample_chars):0.2%}'
'76.61%'
Let’s now train a small model; unlike Bengio we don’t need to worry about parallelism - it runs fast enough on a CPU.
This small MLP is competitive with the bigram model that got a loss of 2.73.
= MLP(2, 10).requires_grad_()
mlp
= train(mlp, 100_000, lambda _step, _n_step: 0.1)
losses, val_losses
*zip(*losses))
plt.plot(*zip(*val_losses))
plt.plot(
'log')
plt.yscale( F.cross_entropy(mlp(X_val), y_val).item()
2.7286570072174072
Note that it only has a fraction of the parameters of the bigram model
f'{sum([len(p) for p in mlp.parameters()]) / (V*V):0.2%} of the parameters of the bigram model'
'7.06% of the parameters of the bigram model'
Because we chose a 2 dimensional embedding we can plot the characters.
There are some features that make sense:
- The vowels are in a cluster together
- The numbers form clusters
= mlp.C.detach()
emb
= plt.subplots()
fig, ax *zip(*emb), s=200)
ax.scatter(
for i, txt in enumerate(i2s):
='center', verticalalignment='center', color='white') ax.annotate(txt, emb[i], horizontalalignment
We can sample from this model; it’s a little English-y but mostly noise.
= torch.Generator().manual_seed(241738025913704212)
g for _ in range(20):
print(sample(mlp, generator=g))
aulaxbunale
nomawkby
chirlarhanosedescus
icles
ylorfruth
gakicsty
tinarcicy
xhamgecrusklabkios
nantyel
rodcox
spare
perwarapu
actesamil
8agan
akinzimedht
artnindams
bolp
p_na_wfizeycorfedpeplyirgarkmeyekbelipamubitwera
btspchupcilra
raramis
We could also scale up to a larger model.
We can increase the learning rate exponentially to find a good learning rate. Here 0.1 is reasonable because we want somewhere before the loss starts to diverge.
def lr_finder(step, N_step):
return 10**(-5+ 8*step/N_step)
= MLP(30, 200).requires_grad_()
mlp
= train(mlp, 1000, lr_finder)
losses, val_losses
0], 1000) for l in losses], [l[1] for l in losses])
plt.plot([lr_finder(l[1, 10**5))
plt.ylim(('log')
plt.yscale('log') plt.xscale(
We can then train it and get an even better result.
= MLP(30, 200).requires_grad_()
mlp
= train(mlp, 100_000, lambda _step, _n_step: 0.1)
losses, val_losses
*zip(*losses))
plt.plot(*zip(*val_losses))
plt.plot(
'log')
plt.yscale( F.cross_entropy(mlp(X_val), y_val).item()
2.568157196044922
The results are still not great but there are little pieces of real words in some of them.
= torch.Generator().manual_seed(241738025913704212)
g for _ in range(20):
print(sample(mlp, generator=g))
ancanburgee
nimag
bybost
carencosleencus
inles
natsfrush
gatics
yatinarching
games
ruskncheroa
nahenews
fucofba
fates
awthaposactionmistragan
akingifwatt
artugn
ami
battor_namainternerfaipermains
famentibelgpamatiowar
antspching
What next?
This showed that even these small tri-gram MLPs can beat the bigram model. However the bigram model was easy to optimise (by counting!) wheras these models are a lot trickier. I don’t know whether the models trained here are close to the global optimum, and I would like to understand this in more detail.