Web-Scraping the NeurIPS proceedings
The NeurIPS proceedings is a website that links to all the accepted papers for every year. Every paper website contains the title, authors, and abstract. Luckily, all these websites are static and we can use simple web scraping by sending HTTP requests to download all the information.
First, let’s grab all the links from the 2021 proceedings (2334 in total). We can find those easily by grabbing all a
-elements that contain “-Abstract.html” in their link.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://papers.nips.cc/paper/2021")
# Parse the response as HTML
soup = BeautifulSoup(response.text, "html.parser")
# Find all the <a> tags on the page (these contain the links)
links = soup.find_all("a")
abstract_links = []
# Print the URLs of the links
for link in links:
if "-Abstract.html" in link["href"]: # Filter the abstracts
abstract_links.append("https://papers.nips.cc" + link["href"])
print(f"{len(abstract_links)} abstracts found")
Now that we have all the links we just need to send an HTTP request for each one and download the paper metadata we are interested in. I will skip the details on extracting the information as it is pretty straightforward and also not the nicest way. To save some time we will attempt to execute 16 requests in parallel using the built-in multiprocessing
library. You can adjust this number depending on the network connection, but please don’t flood the server with requests. This should take less than a minute.
from multiprocessing import Pool
from tqdm import tqdm
def parse_paper_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# The following is not a nice or robust way to filter papers and is prone to
# break if the website layout should change, but is sufficient for a demo
info = {}
info["title"] = soup.findAll("h4")[0].text
info["authors"] = soup.findAll("i")[-1].text
info["abstract"] = soup.findAll("p")[2].text
info["url"] = url
return info
results = []
with Pool(16) as pool: # Execute commands in parallel to speed things up
for result in tqdm(pool.imap_unordered(parse_paper_page, abstract_links)):
results.append(result)
For convenience, we can create a Pandas DataFrame containing all results.
import pandas as pd
df = pd.DataFrame(results)