Web-Scraping the NeurIPS proceedings

2 min readDec 15, 2022

The NeurIPS proceedings is a website that links to all the accepted papers for every year. Every paper website contains the title, authors, and abstract. Luckily, all these websites are static and we can use simple web scraping by sending HTTP requests to download all the information.

First, let’s grab all the links from the 2021 proceedings (2334 in total). We can find those easily by grabbing all a-elements that contain “-Abstract.html” in their link.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://papers.nips.cc/paper/2021")
# Parse the response as HTML
soup = BeautifulSoup(response.text, "html.parser")
# Find all the <a> tags on the page (these contain the links)
links = soup.find_all("a")
abstract_links = []
# Print the URLs of the links
for link in links:
  if "-Abstract.html" in link["href"]:  # Filter the abstracts
    abstract_links.append("https://papers.nips.cc" + link["href"])
print(f"{len(abstract_links)} abstracts found")

Now that we have all the links we just need to send an HTTP request for each one and download the paper metadata we are interested in. I will skip the details on extracting the information as it is pretty straightforward and also not the nicest way. To save some time we will attempt to execute 16 requests in parallel using the built-in multiprocessing library. You can adjust this number depending on the network connection, but please don’t flood the server with requests. This should take less than a minute.

from multiprocessing import Pool
from tqdm import tqdm

def parse_paper_page(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.text, "html.parser")
  # The following is not a nice or robust way to filter papers and is prone to 
  # break if the website layout should change, but is sufficient for a demo
  info = {}
  info["title"] = soup.findAll("h4")[0].text
  info["authors"] = soup.findAll("i")[-1].text
  info["abstract"] = soup.findAll("p")[2].text
  info["url"] = url
  return info

results = []
with Pool(16) as pool:  # Execute commands in parallel to speed things up
  for result in tqdm(pool.imap_unordered(parse_paper_page, abstract_links)):
    results.append(result)

For convenience, we can create a Pandas DataFrame containing all results.

import pandas as pd

df = pd.DataFrame(results)

Web-Scraping the NeurIPS proceedings

Written by Paul Gavrikov

No responses yet