In this article I’m going to create a web scraper in Python that will scrape Wikipedia pages.
The scraper will go to a Wikipedia page, scrape the title, and follow a random link to the next Wikipedia page.
I think it will be fun to see what random Wikipedia pages this scraper will visit!
Table of Contents
Setting up the scraper
To start, I’m going to create a new python file called scraper.py:
touch scraper.py
To make the HTTP request, I’m going to use the requests library. You can install it with the following command:
pip install requests
Let’s use the web scraping wiki page as our starting point:
import requests
response = requests.get(
url="https://en.wikipedia.org/wiki/Web_scraping",
)
print(response.status_code)
When running the scraper, it should display a 200 status code:
python3 scraper.py
200
Alright, so far so good! ?
Extracting data from the page
Let’s extract the title from the HTML page. To make my life easier I’m going to use the BeautifulSoup package for this.
pip install beautifulsoup4
When inspecting the Wikipedia page I see that the title tag has the #firstHeading ID.
Beautiful soup allows you to find an element by the ID tag.
title = soup.find(id="firstHeading")
Bringing it all together the program now looks like this:
import requests
from bs4 import BeautifulSoup
response = requests.get(
url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.string)
And when running this, it shows the title of the Wiki article: ?
python3 scraper.py
Web scraping
Scraping other links
Now I’m going to dive deep into Wikipedia. I’m going to grab a random tag to another Wikipedia article and scrape that page.
import requests
from bs4 import BeautifulSoup
import random
response = requests.get(
url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.content)
# Get all the links
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# We are only interested in other wiki articles
if link['href'].find("/wiki/") == -1:
continue
# Use this link to scrape
linkToScrape = link
break
print(linkToScrape)
As you can see, I use the soup.find(id=”bodyContent”).find_all(“a”) to find all the tags within the main article.
Since I’m only interested in links to other wikipedia articles, I make sure the link contains the /wiki prefix.
When running the program now it displays a link to another wikipedia article, nice!
python3 scraper.py
<a href="/wiki/Link_farm" title="Link farm">Link farm</a>
Creating an endless scraper
Alright, let’s make the scraper actually scrape the new link.
To do this I’m going to move everything into a scrapeWikiArticle function.
import requests
from bs4 import BeautifulSoup
import random
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# We are only interested in other wiki articles
if link['href'].find("/wiki/") == -1:
continue
# Use this link to scrape
linkToScrape = link
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Web_scraping")
The scrapeWikiArticle function will get the wiki article, extract the title, and find a random link.
Then, it will call the scrapeWikiArticle again with this new link. Thus, it creates an endless cycle of a Scraper that bounces around on wikipedia.
Let’s run the program and see what we get:
pythron3 scraper.py
Web scraping
Digital object identifier
ISO 8178
STEP-NC
ISO/IEC 2022
EBCDIC 277
Code page 867
Code page 1021
EBCDIC 423
Code page 950
G
R
Mole (unit)
Gram
Remmius Palaemon
Encyclopædia Britannica Eleventh Edition
Geography
Gender studies
Feminism in Brazil
Awesome, in roughly 10 steps we went from “Web Scraping” to “Feminism in Brazil”. Amazing!
Conclusion
We’ve built a web scraper in Python that scrapes random Wikipedia pages. It bounces around endlessly on Wikipedia by following random links.
This is a fun gimmick and Wikipedia is pretty lenient when it comes to web scraping.
There are also harder to scrape websites such as Amazon or Google. If you want to scrape such a website, you should set up a system with headless Chrome browsers and proxy servers. Or you can use a service that handles all that for you like this one.
But be careful not to abuse websites, and only scrape data that you are allowed to scrape.
Happy coding!