ML 127: Encoding Text with BERT (10 pts)

What You Need

Purpose

To encode input text with BERT (Bidirectional Encoder Representations from Transformers) in a form useful for many Natural Language Processing tasks.

As detailed below, BERT learns by predicting masked words from context (the words before and after it), and by recognizing whether sentences are in the correct order. This creates an encoding system that assigns a multidimensional vector value to each word, corresponding to that word's meaning in relation to other words.

BERT reduces information written in pargraphs of natural language to numerical vectors which can be easily used to perform many tasks, such as generating more text, translating languages, summarizing text, or categorizing text (sentiment analysis).

Using Google Colab

In a browser, go to
https://colab.research.google.com/
If you see a blue "Sign In" button at the top right, click it and log into a Google account.

From the menu, click File, "New notebook".

Understanding Tokenization

In LLMs, words in a sentence are converted to numerical tokens.

THe tokens are then converted to Embeddings, which are high-dimensional vectors, as shown below.

Tokenizing Three Sentences

Execute the code below to see how three sentences are encoded.
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
text = ("Napoleon revolutionised military organisation. "
"Napoleon has legacy. "
"Legacy still lives. "
)
rep = tokenizer(text, return_tensors = "pt")

words = ("Napoleon", "revolution", "ised", "military", "organisation", ".", 
  "Napoleon", "has", "legacy", ".", "Legacy", "still", "lives", ".")

tokens = []
for t in rep.input_ids[0]:
  tokens.append(int(t))

print()
word_index = 0
for t in tokens:
  if int(t) < 150:
    print(t)
  else:  
    print(t, "\t", words[word_index])
    word_index += 1
The word "Napoleon" appears twice, and is encoded to the same value each time, as shown below.

The start and end of the text is marked by the tokens 101 and 102.

Masking

BERT learns how words are related by randomly "masking" words, and predicting the masked words from the words around them.

Execute the code below to see an example of masking.

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
text = ("Napoleon revolutionised military organisation. "
"Napoleon has legacy. "
"Legacy still lives. "
)
rep = tokenizer(text, return_tensors = "pt")

words = ("Napoleon", "revolution", "ised", "military", "organisation", ".", 
  "Napoleon", "has", "legacy", ".", "Legacy", "still", "lives", ".")

tokens = []
for t in rep.input_ids[0]:
  tokens.append(int(t))

# Mask 15% of the tokens randomly
rand = torch.rand(rep.input_ids.shape)
mask_arr = (rand < 0.15) * (rep.input_ids != 101) * (rep.input_ids != 102)
selection = torch.flatten(mask_arr[0].nonzero()).tolist()
rep.input_ids[0, selection] = 103
after_masking = rep.input_ids

masked_tokens = []
for t in after_masking[0]:
  masked_tokens.append(int(t))

print()
print("Token", "\t", "Masked")
word_index = 0
for (t, a) in zip(tokens, masked_tokens):
  if int(t) < 150:
    print(t)
  else:  
    print(t, "\t", a, "\t", words[word_index])
    word_index += 1
The masked tokens appear in the second column in the image below--some of them have been changed to 103, outlined in yellow in the image below.

If you don't see any masked tokens, run the code again.

Predicting a Masked Word

Execute the code below to see how BERT fills in masked words.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Artificial Intelligence [MASK] take over the world.")
You see several possible words to fill in the mask, with the highest scoring word at the top.

Sequencing Sentences

To learn how sentences are related, BERT uses sentence pairs. The second sentence in each pair is either left unchanged, or replaced by a random different sentence from the data. BERT learns to recognize whether the sentences are in the correct order or not.

As shown below, [CLS] and [SEP] tokens indicate the start and end of the first sentence, and each token is embedded in a multidimensional vector. The embedding is done by a neural net, which is trained to correctly predict masked words, and also to correctly recognize sentences that are out of order.

Execute this code to see how sentence order is encoded:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
SentenceA = ("Napoleon has legacy")
SentenceB = ("Legacy still lives")
rep = tokenizer(SentenceA,SentenceB, return_tensors = "pt")

words = ["\t", "[CLS]", "Napoleon has", "legacy", "[SEP]", 
  "Legacy", "still", "lives", "[SEP]"]
for w in words:
  print(w, end="\t")
print()

for category in rep:
  print(category, end="\t")
  for r in rep[category]:
    for token in r.tolist():
      print(token, end="\t")
    print()
As shown below, the "input_ids" row shows the tokens, and the "token_type_ids" row identifies which sentence each token is in. The "attention_mask" is always 1 in this case, indicating that the model should pay attention to every token, not ignoring any of them.

Getting Movie Review Data

We'll use data from the IMDB Dataset of 50K Movie Reviews.

I made a smaller list of 100 movie reviews, just to make the project run faster.

Execute this code to download the data:

!wget https://samsclass.info/ML/proj/IMDB100a.csv

import pandas as pd
pd.set_option('display.max_rows', 4)

df_orig = pd.read_csv('IMDB100a.csv')
print("Original Data:")
print(df_orig)
df = df_orig.drop('sentiment', axis=1)

print()
print("Data Without Sentiment")
print(df)
As shown below, the data contains a 'review' field with a review of each movie, followed by a 'sentiment' field which is either 'positive' or 'negative'.

We'll send the "Data Without Sentiment" to a pretrained BERT model to see how well it can decide whether the review is positive or negative.

Sentiment Analysis with BERT

Execute this code to run a pretrained BERT sentiment analysis model on the data:
import torch
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = BertForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
def sentiment_movie_score(movie_review):
	token = tokenizer.encode(movie_review, return_tensors = 'pt')
	result = model(token)
	return int(torch.argmax(result.logits))+1
df['sentiment'] = df['review'].apply(lambda x: sentiment_movie_score(x[:512]))

print(df)
As shown below, the model outputs a 'sentiment' field which is an integer from 1 (very negative) to 5 (very positive).

Evaluating the Model's Accuracy

We'll count the number of reviews that are erroneously classified, in two categories: Execute this code:
num_wrong = 0
sentiment_sum = 0

for (sentiment_predicted, sentiment_correct) in zip(df['sentiment'], df_orig['sentiment']):
  if (sentiment_predicted < 3) and (sentiment_correct == "positive"):
    num_wrong += 1
    print(sentiment_predicted, sentiment_correct) 
  elif (sentiment_predicted > 3) and (sentiment_correct == "negative"):
    num_wrong += 1
    print(sentiment_predicted, sentiment_correct) 
  sentiment_sum += sentiment_predicted

print()
print("BERT got ", num_wrong, "wrong out of 100 movies.")
print()
print("Total sentiment:", sentiment_sum)

ML 127.1: Sum of Sentiments (10 pts)

The flag is covered by a green rectangle in the image below.

References

An Introduction to BERT And How To Use It

How to Fine-Tune BERT for Sentiment Analysis with Hugging Face Transformers

Fine Tuning BERT for Sentiment Analysis

BERT 101 - State Of The Art NLP Model Explained

Posted 6-3-24
Masked word prediction added 6-4-24