Introduction to Prerequisite Learning in NLP

4 min readDec 3, 2020

In this article, I will give a brief overview of Prerequisite Learning, document my code for reproducing existing baseline approaches, and explain how graph neural networks can be leveraged to achieve the state-of-the-art unsupervised approach to this task.

Prerequisite Learning and its Motivation

In today’s world abundant with online resources, it is often challenging for a newcomer to know where to start when trying to learn about a field. The task of prerequisite learning requires machines to automatically answer the question “which topic should one learn first.” Solving this task helps us organize information, create education curricula, display relevant content in field-specific search engines, and generate Wikipedia-style surveys of various topics.

Available datasets

LectureBank2: contains free text of 1717 online lecture files from 60+ courses covering 5 different domains, including Natural Language Processing, Machine Learning, Artificial Intelligence, Deep Learning, and Information Retrieval. Prerequisite information for each possible pair of topics, among 322 topics in total, has been manually annotated. (https://github.com/Yale-LILY/LectureBank)
The LILY lab at Yale is currently constructing a new dataset for this task and will release the dataset soon on Github.

Exploring BERT baseline approaches

I. Pretrained BERT embeddings + classifier

The most straightforward approach to prerequisite learning is to retrieve pre-trained BERT embeddings of the concept words, concatenate the source and target concept embeddings together, then pass the concatenated embeddings as input to a Logistic Regression, Support Vector Machine, Naïve Bayes, and/or Random Forest classifier. The classifier should output a 0 to indicate the lack of any prerequisite relationship between the source and target concepts and 1 to indicate that understanding the source concept is a prerequisite to learning the target concept.

More specifically, one could retrieve the pre-trained BERT embeddings with a few simple commands.

!pip install transformers
from transformers import BertTokenizer, BertModeltokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
input_ids = []
attention_masks = []for topic in topics:
    encoded_dict = tokenizer.encode_plus(
        topic, 
        add_special_tokens=True, 
        truncation = True, 
        max_length = max_len, # maximum number of words in each sentence
        padding = 'max_length',
        return_attention_mask = True,   
        return_tensors = 'pt')
    
    input_ids.append(encoded_dict['input_ids'])# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True)
# Put the model in "evaluation" mode
model.eval()input_tensors = torch.cat(input_ids, dim=0)
last_hidden = model(input_tensors)
# last_hidden is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the BERT model)
print(last_hidden[0].shape)# extract only the BERT output for the [CLS] token, in preparation for classification
features = last_hidden[0][:,0,:].detach().numpy()
print(features.shape)

II. Fine-tuned BERT embeddings + classifier

A second approach would be fine-tuning BERT embeddings on our task-specific corpus before feeding them to a classifier. There are many existing blogs written about how to fine-tune BERT end-to-end for a sequence classification task. However, for prerequisite learning we have to use a 2-stage approach instead of an end-to-end approach to classification, since there is a necessary intermediate step of concatenating the source and target embeddings. Thus, we should fine-tune BERT embeddings for the Masked Language Modeling task instead. Knowing how to do this is helpful for other downstream tasks as well. When trying to do this for the first time, I couldn’t find any existing tutorials, so I will lay out the specific steps I took here:

Clone this github repo: https://github.com/huggingface/transformers
Make sure you understand the README file and run_mlm.py under /examples/language-modeling
Run the command python3 run_mlm.py — model_name_or_path bert-base-uncased — train_file [insert your own ] — do_train — output_dir /tmp/test-mlm Specify — max_seq_length if you have memory constraints.
Now load the fine-tuned model, and retrieve the fine-tuned embeddings for the source and target topics.

model = BertModel.from_pretrained(os.getcwd() + ‘/combined-test-mlm’)
model.eval()with torch.no_grad():
 last_hidden = model(input_tensors)
 features = last_hidden[0][:,0,:].detach().numpy()

Relevant neural graph architectures

In recent years, Li et. al also attempted neural graph approaches that determine prerequisite relations in an unsupervised way (meaning gold prerequisite annotations are not provided to the model during training). The authors applied Graph Autoencoders, Variational Graph Autoencoders and Relational-Variational Graph Autoencoders to prerequisite learning in two papers published in 2018 and 2020 (here and here). The backbone of all these models is the Graph Convolutional Network (GCN), so I will focus my attention for the rest of this article on explaining how to apply GCN to the specific task of prerequisite learning.

Graph Autoencoders consist of a 2-layer GCN as an encoder and a simple inner product function as the decoder. Thomas Kipf and my collaborator at the Yale LILY lab, Irene Li, have each written awesome blogs explaining GCNs, available here and here. Overall, the goal of GCNs is to learn a function of features on a graph G = (V, E), where V=vertices and E=edges. To apply GCNs to our task of prerequisite learning, we need to construct the input to the model as follows:

Create a feature matrix X, which contains a feature description for each node. This matrix should have dimension N x d, where N = number of nodes and D = number of input features. One way to initialize this matrix is to use BERT embeddings of the resources and topics.
Create an adjacency matrix A to represent the graph structure. One possible way to construct A is to retrieve TFIDF tokenizations of all the concepts, create a matrix out of these tokenizations, then calculate cosine similarities on the matrix.

# import libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from scipy.sparse import coo_matrix# Define a tokenizer
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for item in tokens:
        stems.append(PorterStemmer().stem(item))
    return stems# Tokenize the text
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(texts) 
# The texts variable above should be a concept list# Calculate cosine similarities on the matrix to return a matrix A.
similarities = cosine_similarity(sparse.csr_matrix(tfs), dense_output=False)return sparse.coo_matrix(similarities)

Having specified our task-specific inputs, we can then simply run the GCN model provided in this github repo.

References

Introduction to Prerequisite Learning in NLP

Written by Vanessa Yan