How to visualize and search conference proceedings with Natural Language Processing models (NLPs)
Building this simple Python tool based on HuggingFace transformers, scikit-learn, pandas, and plotly will change the way you search for papers
One of the largest and most prestigious AI conferences has just wrapped last week — the 36th Conference on Neural Information Processing Systems (NeurIPS). I was fortunate to be able to attend in person and even present two papers there. But such large conferences can be overwhelming. Thousands of papers are accepted covering all kinds of topics. However, typically, you will only be interested in a few topics, and finding those simply by looking at the title or abstract can be hard. But this year there was a change!
I’ve noticed that they turned last year’s demo of paper visualization/search into a full feature. Instead of digging through abstracts, you get a nice map of papers, where papers are clustered by similarity and allows you to explore the neighborhood of a specific query or paper. This makes discovering interesting or related papers so much easier! Of course, I only found out about this feature after I returned home …
Nevertheless, I was wondering how complicated it was to reproduce this and it turns out that thanks to all the open-source libraries and models it is super simple to implement a demo like that!
In this article, I will show you how to write simple Python code that uses natural language processing models (NLPs) from the transformers library in combination with the scikit-learn, UMAP-learn, pandas, and plotly libraries to replicate this demo. For this article, we will work with papers from the NeurIPS 2021 proceedings as the 2022 proceedings are not publicly available yet — but it should be straightforward to adjust when they become available. Keep in mind though, that you aren’t limited to papers, you can also modify the code to work on any kind of other text, e.g. you could work with tweets or news articles.
The working principle is very easy: we convert text (e.g. paper abstracts) to an embedding (vector) in some multi-dimensional space through a pre-trained NLP. Text with similar meanings should be close together, while text with different meanings should be further apart. Then, to find related papers given some input paper, we just measure the distances and find the closest neighbors in the embedding space. If we want to implement search functionality for new abstracts we just use the same NLP to generate a new embedding for our input, and apply our previous approach. Lastly, if we just want to gaze at the landscape of all papers (because data is beautiful!) and find clusters of “hot topics” we can just use a manifold reduction technique to project our highly-dimensional data to 2D. Then we can simply plot the data using a scatter plot or similar.
So let’s go ahead and code it! I am working with a free Google Colab GPU instance. In theory, you should also be able to replicate this on CPU, but it vastly increases the processing time.
Installing dependencies
We will need some external dependencies: the transformers libary which contains access to various pre-trained models hosted at HuggingFace and umap-learn which implements the UMAP algorithm (alternatively use T-SNE from scikit-learn — more on that later). Note that if you execute this code on a different runtime than Google Colab, you may have to install additional dependencies.
!pip install -q transformers
!pip install -q umap-learn
Getting the data
The NeurIPS proceedings contain all the accepted papers for every year. Every paper entry consists of a title, authors, and abstract. For convenience, you can download all these infos for the 2021 proceedings from Kaggle: https://www.kaggle.com/datasets/paulgavrikov/neurips-2021-abstracts
Make sure you download 2021_neurips_abstracts.json
and place it in your script path. Then, simply load it in pandas:
import pandas as pd
df = pd.read_json("2021_neurips_abstracts.json")
Generating Text Embeddings
To be able to compare text, we would like to map (embed) it to a vector in some high-dimensional space that preserves the semantics — that is text with a similar meaning is mapped to two vectors close to each other, while non-related text is spaced far apart. NLPs can do that and an example of an NLP you may be familiar with is GPT-3 (but that one is not usable for free). Still, there are many publicly available pre-trained models that are free to use. For this demo, we will use Sentence-BERT (all-MiniLM-L6-v2) but you should be able to replace the model with any other text embedding model. Sentence-BERT embeds entire blocks of text into 384-dimensional vectors.
The following code is taken from the official repo and just slightly adjusted. It generates tokenizes the input, generated embeddings for the tokens, and then applies pooling plus normalization to the output. But you can skip the details and simply call the get_embedding
function to get an embedding for any input text.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2").to(device)
def get_embedding(inp_text: str):
# Tokenize sentences
encoded_input = tokenizer([inp_text], padding=True, truncation=True, return_tensors="pt").to(device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings[0].detach().cpu().numpy()
Except for the pooling, this code should work with any text embedding model on HuggingFace. This code will use any CUDA GPU if available (recommended) or fallback to CPU. If you are running this for the first time, the library will download the pre-trained model. Be careful if you experiment with other models as they may only do word embedding (i.e. produce a vector for each token, which makes it hard to compare text with varying lengths).
Now, let’s embed all the paper metadata we have previously obtained. Since both the abstract and title may contain important information, we will use a joint string of both to generate the embedding.
embeddings = []
for index, row in tqdm(df.iterrows()):
input_text = f"{row.title}. {row.abstract}"
embedding = get_embedding(input_text)
embeddings.append(embedding)
df["embedding"] = embeddings
This should take less than a minute on a Colab GPU instance. Once completed, we now have successfully mapped all paper titles and abstracts to 384D vectors. We can now go ahead and build a search engine and visualize the paper landscape based on the embeddings.
Searching for similar papers
Let’s go ahead and build a simple search engine that given a paper finds the top 5 most related papers. This is as simple as finding the top 5 papers (K-Nearest-Neighbors) with the minimum distance in embedding space. There are multiple ways of measuring distance so we will just stick to the default (Minkowski distance). To do this efficiently, we can use scikit-learn’s NearestNeighbors
class which builds up an index for efficient queries.
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=5).fit(embeddings)
Now let’s assume we want to find papers that are similar to the first paper in our database. Then we can query as follows:
distances, indices = nbrs.kneighbors(df.iloc[0].embedding.reshape(1, -1))
proximity_matches = df.iloc[indices[0]]
proximity_matches["distance"] = distances[0]
proximity_matches
This will return as a DataFrame slice with the top-5 closest papers in embedding space (including the original paper). Easy, right?
But what if we want to search for similar papers to a paper that is not in our dataset? For example, I wrote a paper called CNN Filter DB which was published at a different conference (CVPR), and would like to see if there are related papers at NeurIPS. That is also possible! We just compute an embedding based on the title and abstract as before and apply the search again:
search = "CNN Filter DB. Currently, many theoretical as well as practically relevant questions towards the transferability and robustness of Convolutional Neural Networks (CNNs) remain unsolved. While ongoing research efforts are engaging these problems from various angles, in most computer vision related cases these approaches can be generalized to investigations of the effects of distribution shifts in image data. In this context, we propose to study the shifts in the learned weights of trained CNN models. Here we focus on the properties of the distributions of dominantly used 3x3 convolution filter kernels. We collected and publicly provide a data set with over 1.4 billion filters from hundreds of trained CNNs, using a wide range of data sets, architectures, and vision tasks. In a first use case of the proposed data set, we can show highly relevant properties of many publicly available pre-trained models for practical applications: I) We analyze distribution shifts (or the lack thereof) between trained filters along different axes of meta-parameters, like visual category of the data set, task, architecture, or layer depth. Based on these results, we conclude that model pre-training can succeed on arbitrary data sets if they meet size and variance conditions. II) We show that many pre-trained models contain degenerated filters which make them less robust and less suitable for fine-tuning on target applications."
distances, indices = nbrs.kneighbors(get_embedding(search).reshape(1, -1))
proximity_matches = df.iloc[indices[0]]
proximity_matches["distance"] = distances[0]
proximity_matches
Indeed, we get good results. Pay attention to the last paper (Deeply Shared Filter Bases for Parameter-Efficient Convolutional Neural Networks). In CNN Filter DB we have also compared filter bases and therefore the suggested paper is highly relevant to us. Surprisingly, however, my abstract does not mention “filter bases”, yet, the embedding still seems to capture the semantic similarity — isn’t that amazing?
Visualizing the Embeddings
The embeddings we have generated are in 384D space, which is hard to imagine and even harder to visualize. To visualize the papers and their relationship in a simple scatter plot we need to reduce the dimensionality — so let’s project the embeddings from 384D to 2D. There are multiple ways of achieving this. For example, you may be familiar with applying PCA for dimensionality reduction. However, PCA it is not good at preserving local structures (clusters). We want a projection that retains clusters in 2D. As a trade-off, we will lose global structure which is not as important for this task. A common method that preserves local structure is t-SNE, but an even more modern method is UMAP. Although somewhat similar, it is a bit more efficient when dealing with larger sample sizes. Note that using UMAP requires an external library. Please feel free to replace and experiment with any other manifold reduction technique.
from umap import UMAP
manifold = UMAP(n_components=2, init="random", random_state=0)
projections = manifold.fit_transform(embeddings)
Success! Our 384D embeddings are now projected to 2D. We can go ahead and plot the (projected) embeddings. We could use the good ol’ matplotlib but plotly allows us to create an interactive plot in notebooks, so let’s use that.
import plotly.express as px
fig = px.scatter(
df, x=projections[:, 0], y=projections[:, 1], hover_name="title",
hover_data=["authors"],
width=1000, height=1000
)
fig.show()
And that’s it! Now hover over the plot and explore the paper landscape. You should see that there are certain clusters that cover similar topics, e.g. robustness. In many cases, the papers will have similarly worded titles, but you should also notice that differently worded, yet semantically similar papers form clusters. In the following picture, I have tried to highlight some clusters that I have noticed (educated guesses). You will notice that there is also a large patch of non-clustered papers — and that is good! This means that these papers are in the middle and explore all sorts of topics. After all, if we had highly distinct clusters it may be fruitful to disentangle them into separate conferences.
And now it is your turn!
Explore, what happens if you switch to a different embedding model, only use the title or abstract, use a different dimensionality reduction technique, or use an entirely different data set! Let me know in the comments what you have built!
The Colab notebook is available here: https://colab.research.google.com/drive/1CVIKRr2oKIV-mVW1tnvtYIMXrFTpZEiP?usp=sharing
Thank you for reading this article! If you enjoyed it please consider subscribing to my updates. If you have any questions feel free to leave them in the comments.