Web Scraping 101

Python

GESIS

computational social science

data science

data collection

web scraping

apis

tutorial

Module 2.1, Introduction to Computational Social Science (Python), GESIS Fall Seminar 2024

Author

Affiliation

John McLevey

University of Waterloo

Published

August 26, 2024

Modified

August 28, 2024

Introduction

In this tutorial, you’ll learn some foundational web scraping skills using Python, focusing specifically on working with static web pages. To help you develop your skills, we’ll work through an extended example of extracting data from the World Happiness Report Wikipedia page. We’ll start with the basics – loading and parsing HTML – and gradually move on to more complex tasks like handling nested tables and creating visualizations from the data we collect. By the end of this tutorial, you know how to navigate the structure of an HTML page, extract specific pieces of data such as headers, body text, and tables, and clean the data for analysis.

Learning Objectives

In this tutorial, you will learn how to:

Load a static website using requests to obtain the HTML content of a webpage.
Parse HTML using BeautifulSoup to structure and navigate the content.
Extract headers and body text from a webpage for further analysis.
Extract and process links from the webpage, including differentiating between relative, full, and internal links.
Create a DataFrame from extracted data, making it easier to analyze and visualize.
Handle and clean HTML tables, including dealing with nested tables that may complicate data extraction.
Visualize data by creating simple plots from the cleaned and processed information.

Scraping The Happiness Report Wikipedia Page

We’ll use the World Happiness Report Wikipedia page to demonstrate some simple web scraping techniques. We’ll use an earlier version of the page rather than the version that is currently live.

Setup

As always, we’ll start by importing packages, including

requests for making HTTP requests,
urllib for processing URL data, and
BeautifulSoup from the bs4 package for parsing HTML.

from collections import Counter
import re
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from io import StringIO
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup

from icsspy import set_style

set_style()

Loading Static Websites with `requests`

The first step in scraping a website is to inspect its source code in your browser and load its content in a way Python can access. We can do the latter using the requests library. The get() function sends a request to the specified URL and returns a response object, which contains all the information sent back from the server, including the HTML source code.

url = "https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1241093905"

response = requests.get(url)
response

<Response [200]>

If you print response.text, you’ll see the raw HTML of the webpage. This is the data we’ll be working with in the rest of this tutorial.

Parsing HTML with `BeautifulSoup`

Now that we have the HTML content, we need to parse it into a structured format. BeautifulSoup helps us do this by converting the HTML string into a navigable tree structure. This allows us to easily search for and extract specific elements, such as headers, paragraphs, and links.

Extracting Text with `BeautifulSoup`

First, we create a BeautifulSoup object by passing the HTML content to the BeautifulSoup constructor along with the desired parser. We’ll use the html.parser, which is the default option.

html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')

Extracting Headers

To extract headers, we can search the soup object for all header tags (h1 to h6). Below, we loop through the results and extract the text content of each header.¹

¹ By default, the .get_text() methods removes leading and training whitespace from each header. If we want to be more aggressive about removing whitespace (e.g., replace double spaces with single spaces), we can set the .get_text() method’s strip argument to True.

headers = []

for header in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
    headers.append(header.get_text())

headers

['Contents',
 'World Happiness Report',
 'History',
 'Methods and philosophy',
 'WELLBYs',
 'Happiness of the young and old',
 'Annual report topics',
 '2024 World Happiness Report',
 '2023 World Happiness Report',
 '2022 World Happiness Report',
 '2021 World Happiness Report',
 '2020 World Happiness Report',
 '2019 World Happiness Report',
 '2018 World Happiness Report',
 '2017 World Happiness Report',
 '2016 World Happiness Report',
 '2015 World Happiness Report',
 '2013 World Happiness Report',
 '2012 World Happiness Report',
 'International rankings',
 '2024 report',
 '2023 report',
 '2022 report',
 '2020 report',
 '2019 report',
 '2018 report',
 '2017 report',
 '2016 report',
 '2013 report',
 'Criticism',
 'Metrics',
 'Methodology',
 'Legitimacy',
 'See also',
 'Notes',
 'References',
 'External links']

In my experience, this argument often ends up removing meaningful whitespace, which can causes headaches downstream when doing things like text analysis. For that reason, I prefer not to use it and simply do additional cleaning if/when it is needed.

This code collects all headers into a list.

Extracting Body Text

Similarly, we can extract all paragraphs (<p> tags) from the HTML document. We’ll collect the text from each paragraph into a list.

body_text = []
for paragraph in soup.find_all(['p']):
    body_text.append(paragraph.get_text())

print(f'Found {len(body_text)} paragraphs')

Found 88 paragraphs

There are 88 paragraphs in this document. Let’s preview the first 5.

body_text[:5]

['This is an old revision of this page, as edited by 41.189.206.7 (talk) at 08:42, 19 August 2024 (→\u200e2023 report). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.',
 'The World Happiness Report is a publication that contains articles and rankings of national happiness, based on respondent ratings of their own lives,[1] which the report also correlates with various (quality of) life factors.[2] The report primarily uses data from the Gallup World Poll. As of March 2024, Finland has been ranked the happiest country in the world seven times in a row.[3][4][5][6][7]\n',
 'Since 2024, the report has been published under a partnership between Gallup, the Wellbeing Research Centre at the University of Oxford, and the UN Sustainable Development Solutions Network.[8] The editorial team includes three founding editors, John F. Helliwell, Richard Layard, and Jeffrey D. Sachs, and editors, Jan-Emmanuel De Neve, Lara Aknin, and Shun Wang.[9]\n',
 'In July 2011, the UN General Assembly adopted resolution 65/309 Happiness: Towards a Holistic Definition of Development[10] inviting member countries to measure the happiness of their people and to use the data to help guide public policy. \n',
 'The first World Happiness Report was released on 1 April 2012, as a foundational text for the UN High Level Meeting: Well-being and Happiness: Defining a New Economic Paradigm,[11] drawing international attention.[12] On 2 April 2012, this was followed by the first UN High Level Meeting called Wellbeing and Happiness: Defining a New Economic Paradigm,[13] which was chaired by UN Secretary General Ban Ki-moon and Prime Minister Jigmi Thinley of Bhutan, a nation that adopted gross national happiness instead of gross domestic product as their main development indicator.[14]\n']

Extracting text like this is useful for text analysis, which we’ll cover in the next module. For now, we’ll create a dataframe from this data

ordered_text = pd.DataFrame(body_text)
ordered_text.columns = ['paragraph']
ordered_text['sequence'] = ordered_text.index.to_list()

ordered_text.head(10)

	paragraph	sequence
0	This is an old revision of this page, as edite...	0
1	The World Happiness Report is a publication th...	1
2	Since 2024, the report has been published unde...	2
3	In July 2011, the UN General Assembly adopted ...	3
4	The first World Happiness Report was released ...	4
5	The first report outlined the state of world h...	5
6	The rankings of national happiness are based o...	6
7	The life factor variables used in the reports ...	7
8	The use of subjective measurements of wellbein...	8
9	In the reports, experts in fields including ec...	9

and write it to disk for later use.

ordered_text.to_csv(
  'output/happiness_report_wikipedia_paragraphs.csv', index=False
)

Extracting Links

Next, let’s extract all the links on the page. Once again, we can use the find_all() method to find all the anchor tags (<a>), i.e., links.

urls = []
for link in soup.find_all('a'):
    url = link.get('href')
    
    # check if valid href
    if url: 
        urls.append(url)
        
print(f'Found {len(urls)} urls')

Found 2052 urls

When we print these URLs, we’ll see that many are relative links (i.e., links to other Wikipedia pages that start with /). These links won’t work properly if clicked as-is because they’re not full URLs.

for url in urls[:10]:
    print(url)

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction

To fix this, we need to prepend the base URL for Wikipedia (https://en.wikipedia.org) to any relative URLs. Let’s iterate over the URLs, identify which ones are relative, and prepend the base URL to them. We’ll create a new list of URLs called full_urls and append each all our full URLs, whether it was full from the start or expanded by prepending the base URL.

base_url = "https://en.wikipedia.org"

full_urls = []
for url in urls:
    if url.startswith('/'): # relative URL
        full_url = urljoin(base_url, url)
    else:
        full_url = url
    full_urls.append(full_url)

Let’s see what they look like now.

for url in full_urls[:10]:
    print(url)

#bodyContent
https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Wikipedia:Contents
https://en.wikipedia.org/wiki/Portal:Current_events
https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Wikipedia:About
https://en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
https://en.wikipedia.org/wiki/Help:Contents
https://en.wikipedia.org/wiki/Help:Introduction

Processing Links

Next, we’ll filter out any internal page links (e.g., links to sections within the same page) by excluding URLs that start with #.

print(f'{len(full_urls)} URLs before removing internal page links.')
full_urls = [url for url in full_urls if not url.startswith('#')]
print(f'{len(full_urls)} URLs after removing internal page links.')

2052 URLs before removing internal page links.
1857 URLs after removing internal page links.

for url in full_urls[:10]:
    print(url)

https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Wikipedia:Contents
https://en.wikipedia.org/wiki/Portal:Current_events
https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Wikipedia:About
https://en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
https://en.wikipedia.org/wiki/Help:Contents
https://en.wikipedia.org/wiki/Help:Introduction
https://en.wikipedia.org/wiki/Wikipedia:Community_portal

Creating a DataFrame from Links

Now, let’s use what we’ve learned to write a clean code block that:

Finds all the a tags (URLs)
Ignores internal links to sections of the source document
Adds the base Wikipedia URL to any relative links
Collects the link data into a list called link_data
Creates a dataframe from link_data

Additionally, we’ll include a Boolean column indicating whether the link is external (True for external, False otherwise).

link_data = []

for link in soup.find_all('a', href=True):
    if not link['href'].startswith('#'): # ignore internal links
        text = link.get_text()
        href = link['href']
        if href.startswith('/'): # add base URL to relative links
            href = "https://en.wikipedia.org" + href
            link_data.append((text, href, False))
        else:
            link_data.append((text, href, True))

Let’s create a dataframe from this list.

link_df = pd.DataFrame(link_data, columns=['link-text', 'href', "external"])
link_df.head(10)

	link-text	href	external
0	Main page	https://en.wikipedia.org/wiki/Main_Page	False
1	Contents	https://en.wikipedia.org/wiki/Wikipedia:Contents	False
2	Current events	https://en.wikipedia.org/wiki/Portal:Current_e...	False
3	Random article	https://en.wikipedia.org/wiki/Special:Random	False
4	About Wikipedia	https://en.wikipedia.org/wiki/Wikipedia:About	False
5	Contact us	https://en.wikipedia.org//en.wikipedia.org/wik...	False
6	Donate	https://donate.wikimedia.org/wiki/Special:Fund...	True
7	Help	https://en.wikipedia.org/wiki/Help:Contents	False
8	Learn to edit	https://en.wikipedia.org/wiki/Help:Introduction	False
9	Community portal	https://en.wikipedia.org/wiki/Wikipedia:Commun...	False

Now, let’s count the number of internal and external links.

link_df['external'].value_counts().reset_index()

	external	count
0	False	1713
1	True	144

Extracting Primary Domains

Let’s extract and count primary domains for external links. We’ll define a simple function to do this.

def extract_primary_domain(url):
    """
    Extracts the primary domain from a URL by splitting the netloc by '.' 
    and taking the second-to-last element. Returns the primary domain.
    """
    netloc = urlparse(url).netloc
    parts = netloc.split('.')
    if len(parts) > 1:
        primary_domain = parts[-2]
    else:
        primary_domain = parts[0]  
    return primary_domain

We can use the apply() methods for dataframes to apply our function to each row in the href column and then add the extracted primary domain to a new column.

link_df['primary_domain'] = link_df['href'].apply(extract_primary_domain)
link_df

	link-text	href	external	primary_domain
0	Main page	https://en.wikipedia.org/wiki/Main_Page	False	wikipedia
1	Contents	https://en.wikipedia.org/wiki/Wikipedia:Contents	False	wikipedia
2	Current events	https://en.wikipedia.org/wiki/Portal:Current_e...	False	wikipedia
3	Random article	https://en.wikipedia.org/wiki/Special:Random	False	wikipedia
4	About Wikipedia	https://en.wikipedia.org/wiki/Wikipedia:About	False	wikipedia
...	...	...	...	...
1852	Statistics	https://stats.wikimedia.org/#/en.wikipedia.org	True	wikimedia
1853	Cookie statement	https://foundation.wikimedia.org/wiki/Special:...	True	wikimedia
1854	Mobile view	https://en.wikipedia.org//en.m.wikipedia.org/w...	False	wikipedia
1855		https://wikimediafoundation.org/	True	wikimediafoundation
1856		https://www.mediawiki.org	True	mediawiki

1857 rows × 4 columns

Finally let’s count the number of times each primary domain appears and print any that appear twice or more.

primary_domain_counts = link_df.value_counts('primary_domain')
primary_domain_counts = primary_domain_counts.sort_values(ascending=False)
primary_domain_counts = primary_domain_counts.reset_index() 

primary_domain_counts[primary_domain_counts['count'] >= 2]

	primary_domain	count
0	wikipedia	1745
1	worldhappiness	28
2	archive	12
3	wikimedia	7
4	un	6
5	ac	4
6	nytimes	3
7	nih	3
8	gallup	3
9	doi	2
10	oecd	2
11	cnn	2
12	theguardian	2
13	unsdsn	2
14	researchgate	2
15	grossnationalhappiness	2
16	gov	2
17	wikidata	2

Extracting Tables

Next, let’s learn how to extract and process table data from a static website.

Finding and Parsing Tables with `BeautifulSoup` and `Pandas`

We’ll start by locating all tables within the HTML document. As you might expect, BeautifulSoup allows us to search for all table tags in the HTML.²

² In some cases, it may be easier to skip BeautifulSoup entirely when working with tables and instead pass the HTML directly to Pandas. This approach is particularly useful when you’re confident that the page contains well-structured tables. If you run tables = pd.read_html(StringIO(response.text)), Pandas will return a list of dataframes corresponding to each table on the page.

tables = soup.find_all('table')
print(f'Found {len(tables)} table(s)')

Found 34 table(s)

At this point, we’ve identified how many tables are present on the page – 34 – but we aren’t interested in all of them. Let’s take a look at what we have and make a decision about how to proceed.

for index, table in enumerate(tables):
    headers = []
    for th in table.find_all('th'):
        headers.append(th.get_text(strip=True))
    print(f"Table {index+1} Column Names: {headers}")

Table 1 Column Names: ['Country rankings for the under-30s', 'Overall rank', 'Country or region', 'Life evaluation']
Table 2 Column Names: ['Overall rank', 'Country or region', 'Life evaluation']
Table 3 Column Names: ['Descriptions']
Table 4 Column Names: ['Descriptions']
Table 5 Column Names: ['Descriptions']
Table 6 Column Names: ['Descriptions']
Table 7 Column Names: ['Descriptions']
Table 8 Column Names: ['Descriptions']
Table 9 Column Names: ['Descriptions']
Table 10 Column Names: ['Descriptions']
Table 11 Column Names: ['Descriptions']
Table 12 Column Names: ['Descriptions']
Table 13 Column Names: ['Descriptions']
Table 14 Column Names: ['Descriptions']
Table 15 Column Names: ['Table', 'Overall rank', 'Country or region', 'Life evaluation']
Table 16 Column Names: ['Overall rank', 'Country or region', 'Life evaluation']
Table 17 Column Names: ['Table', 'Overall rank', 'Country or region']
Table 18 Column Names: ['Overall rank', 'Country or region']
Table 19 Column Names: ['Table', 'Overall rank', 'Country or region']
Table 20 Column Names: ['Overall rank', 'Country or region']
Table 21 Column Names: ['Table', 'Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 22 Column Names: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 23 Column Names: ['Table', 'Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 24 Column Names: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 25 Column Names: ['Table', 'Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 26 Column Names: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
Table 27 Column Names: ['Table', 'Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual']
Table 28 Column Names: ['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual']
Table 29 Column Names: ['Table', 'Overall Rank[61][62]', 'Country', 'Score', 'Change OverPrior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust']
Table 30 Column Names: []
Table 31 Column Names: ['Overall Rank[61][62]', 'Country', 'Score', 'Change OverPrior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust']
Table 32 Column Names: ['Table', 'Rank[64]', 'Country', 'Happiness']
Table 33 Column Names: ['Rank[64]', 'Country', 'Happiness']
Table 34 Column Names: ['vteLists of countriesbyquality of liferankings', 'General', 'Economic', 'Environment', 'Health', 'Social/political']

This loop extracts the headers from each table, which can help us identify the tables we want.³ Suppose we’re looking for a table that contains information related to “Freedom to make life choices” or other specific indicators. We can easily find tables that contain the relevant column.

³ You may notice that I’m using the strip argument for .get_text() this time. I do find it useful in situations like these, where I want to see the column names (which rarely contain spaces) without any additional characters (such as \n, etc.).

tables_filtered = [
    table 
    for table in tables 
    if "Freedom to make life choices" in table.get_text(strip=True)
]

len(tables_filtered)

Now that we’ve filtered the relevant tables, we can convert them into a format that’s easier to work with, such as a Pandas DataFrame.

dfs = pd.read_html(StringIO(str(tables_filtered)))
len(dfs)

Pandas’ read_html() function is quite powerful and can automatically extract tables from HTML content. However, sometimes it might pick up more than we expect. In this case, it found an additional 6 tables in the same HTML string!

Let’s explore the extracted dataframes to see what we’ve got.

for df in dfs:
    print(list(df.columns))

['Table', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
['Table', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
['Table', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8']
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8']
['Table', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11']
['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual']
['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual']
['Table', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9']
[0, 1]
['Overall Rank [61][62]', 'Country', 'Score', 'Change Over Prior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust']
[0, 1]
['Overall Rank [61][62]', 'Country', 'Score', 'Change Over Prior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust']

Cleaning Up Extracted Tables

We can see that some of our dataframes have columns that are Unnamed. This generally indicates that the table contains multiple headers; we’ll need to inspect the dataframes to see what’s going on. The first dataframe illustrates the problem:

example_df = dfs[0]
example_df.head()

	Table	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	Overall rank Country or region Score GDP per c...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Overall rank	Country or region	Score	GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption
2	1	Finland	7.809	1.285	1.500	0.961	0.662	0.160	0.478
3	2	Denmark	7.646	1.327	1.503	0.979	0.665	0.243	0.495
4	3	Switzerland	7.560	1.391	1.472	1.041	0.629	0.269	0.408

If we just had one table, such as the one we’re looking at here, we could easily clean it up by setting the second row as the header and dropping the irrelevant rows.

example_df.columns = example_df.iloc[1]  
example_df.drop([0, 1], inplace=True)  
example_df = example_df.reset_index()
example_df.head()

1	index	Overall rank	Country or region	Score	GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption
0	2	1	Finland	7.809	1.285	1.500	0.961	0.662	0.160	0.478
1	3	2	Denmark	7.646	1.327	1.503	0.979	0.665	0.243	0.495
2	4	3	Switzerland	7.560	1.391	1.472	1.041	0.629	0.269	0.408
3	5	4	Iceland	7.504	1.327	1.548	1.001	0.662	0.362	0.145
4	6	5	Norway	7.488	1.424	1.495	1.008	0.670	0.288	0.434

This looks much better, but… we aren’t ready to do this for all dataframes just yet. As part of diagnosing the problem, we can spend a bit more time inspecting the page’s source code using our web browser’s developer tools. This reveals the underlying source of the problem: these tables are nested! The tables we want are nested inside tables we don’t care about. How do we get them out?

To address this problem, we can update our approach to identify the outer tables and then, if and when it finds one, look for an inner table. Let’s check the logic of this approach before implementing it.

# Find all the outer tables
outer_tables = soup.find_all('table')

# Loop through all outer tables
for outer_table in outer_tables:
    # Try to find a nested table inside the outer table
    inner_table = outer_table.find('table')

    if inner_table:
        # Convert the inner table to a DataFrame
        table_df = pd.read_html(StringIO(str(inner_table)))[0]
        print(table_df.head(3), '\n')

   Overall rank Country or region  Life evaluation
0             1         Lithuania            7.795
1             2            Israel            7.667
2             3           Iceland            7.658 

   Overall rank Country or region  Life evaluation
0             1           Finland            7.741
1             2           Denmark            7.583
2             3           Iceland            7.525 

   Overall rank Country or region
0             1             India
1             2           Denmark
2             3           Iceland 

   Overall rank Country or region
0             1           Finland
1             2           Denmark
2             3           Iceland 

   Overall rank Country or region  Score  GDP per capita  Social support  \
0             1           Finland  7.809           1.285           1.500   
1             2           Denmark  7.646           1.327           1.503   
2             3       Switzerland  7.560           1.391           1.472   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                    0.961                         0.662       0.160   
1                    0.979                         0.665       0.243   
2                    1.041                         0.629       0.269   

   Perceptions of corruption  
0                      0.478  
1                      0.495  
2                      0.408   

   Overall rank Country or region  Score  GDP per capita  Social support  \
0             1           Finland  7.769           1.340           1.587   
1             2           Denmark  7.600           1.383           1.573   
2             3            Norway  7.554           1.488           1.582   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                    0.986                         0.596       0.153   
1                    0.996                         0.592       0.252   
2                    1.028                         0.603       0.271   

   Perceptions of corruption  
0                      0.393  
1                      0.410  
2                      0.341   

   Overall rank Country or region  Score  GDP per capita  Social support  \
0             1           Finland  7.632           1.305           1.592   
1             2            Norway  7.594           1.456           1.582   
2             3           Denmark  7.555           1.351           1.590   

   Freedom to make life choices  Generosity  Perceptions of corruption  \
0                         0.874       0.681                      0.202   
1                         0.861       0.686                      0.286   
2                         0.868       0.683                      0.284   

   Unnamed: 8  
0       0.393  
1       0.340  
2       0.408   

  Overall Rank Change in rank  Country  Score  Change in score  \
0            1              3   Norway  7.537            0.039   
1            2              1  Denmark  7.522            0.004   
2            3            NaN  Iceland  7.504            0.003   

   GDP per capita  Social support  Healthy life expectancy  \
0           1.616           1.534                    0.797   
1           1.482           1.551                    0.793   
2           1.481           1.611                    0.834   

   Freedom to make life choices  Generosity  Trust  Residual  
0                         0.635       0.362  0.316     2.277  
1                         0.626       0.355  0.401     2.314  
2                         0.627       0.476  0.154     2.323   

                                                   0  \
0  Explained by: GDP per capita  Explained by: So...   

                                                   1  
0  Explained by: Freedom to make life choices  Ex...   

   Rank[64]      Country  Happiness
0         1      Denmark      7.693
1         2       Norway      7.655
2         3  Switzerland      7.650

It appears that finding the inner table gets us the data we want. Before we implement this solution, let’s do one more important bit of data processing: associating tables with the dates of the reports they came from.

Associating Tables with Dates

The main data processing we want to do here is associate each table with the corresponding report year. By inspecting the HTML, we can identify the year each table belongs to by looking at the headers.

Our goal is to extract tables from the webpage and associate each with the publication year of the report the data come from, and we will do that by using information in the section headers preceding the tables we collect. In other words, we want to capture the year from h3 headers and link it to the subsequent table element. We also know that some tables might contain nested tables, and that the data we are interested in is stored in the inner table. We want our code to extract and process these inner tables as needed.

Our code is getting a bit more complex here. If you’re new to Python, it’s helps to know exactly what we are trying to do and why. Here it is, in brief:

Identify the year. We can do this by looking for h3 headers that contain a year (formatted as YYYY) followed by the word ‘report.’ We’ll use a regular expression to identify the pattern. We’ll keep track of the current year as we move through the document.
Associate tables with years. After identifying a year, we check for the next table element. If a table is found, we associate it with the most recently identified year.
Handle Nested Tables. We now know that some tables contain nested tables, and that the data we want is in the inner table. If a nested table is found, we’ll keep and process the inner table and discard the outer table.

# Initialize variables
tables_with_years = []
year_pattern = re.compile(r'(\d{4}) report')  # Regex pattern to identify years
current_year = None

# Loop through all elements, capturing headings and tables
for element in soup.find_all(['h3', 'table']):
    # Check if the element is an h3 heading containing the year pattern
    if element.name == 'h3':
        header_text = element.get_text(strip=True)
        match = year_pattern.search(header_text)
        if match:
            current_year = match.group(1)  # Track the year

    # If the element is a table, associate it with the current year
    elif element.name == 'table' and current_year:
        # Convert the outer table element to a DataFrame
        outer_html_string = str(element)
        outer_table_df = pd.read_html(StringIO(outer_html_string))[0] 
        
        # Check for a nested table within the outer table
        inner_table = element.find('table')
        if inner_table:
            # Convert the inner table to a DataFrame
            inner_html_string = str(inner_table)
            inner_table_df = pd.read_html(StringIO(inner_html_string))[0]
            inner_table_df['Year'] = current_year
            tables_with_years.append(inner_table_df)
        else:
            # If no nested table, add the outer table DataFrame
            outer_table_df['Year'] = current_year
            tables_with_years.append(outer_table_df)

# Check the number of tables collected
len(tables_with_years)

Now that we’ve collected the tables and associated them with their respective years, let’s inspect them to ensure the data was extracted correctly. We’ll loop through the list of tables and check their column names.

for i, table in enumerate(tables_with_years):
    print(f"Table {i} columns:", list(table.columns))

Table 0 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 1 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 2 columns: ['Overall rank', 'Country or region', 'Year']
Table 3 columns: ['Overall rank', 'Country or region', 'Year']
Table 4 columns: ['Overall rank', 'Country or region', 'Year']
Table 5 columns: ['Overall rank', 'Country or region', 'Year']
Table 6 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 7 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 8 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 9 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 10 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 11 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 12 columns: ['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 13 columns: ['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 14 columns: [0, 1, 'Year']
Table 15 columns: [0, 1, 'Year']
Table 16 columns: ['Overall Rank [61][62]', 'Country', 'Score', 'Change Over Prior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Year']
Table 17 columns: ['Rank[64]', 'Country', 'Happiness', 'Year']
Table 18 columns: ['Rank[64]', 'Country', 'Happiness', 'Year']
Table 19 columns: ['vteLists of countries by quality of life rankings', 'vteLists of countries by quality of life rankings.1', 'Year']

Most of these tables look good, although there are two tables that look like they may need some additional work (14 and 15), and we seem to have some duplicate tables.

tables_with_years[14].head()

	0	1	Year
0	Explained by: GDP per capita Explained by: So...	Explained by: Freedom to make life choices Ex...	2016

tables_with_years[15].head()

	0	1	Year
0	Explained by: GDP per capita Explained by: So...	Explained by: Freedom to make life choices Ex...	2016

We’ll set tables 14 and 15 aside for now.

del tables_with_years[15] # higher index first!
del tables_with_years[14]

for i, table in enumerate(tables_with_years):
    print(f"Table {i} columns:", list(table.columns))

Table 0 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 1 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 2 columns: ['Overall rank', 'Country or region', 'Year']
Table 3 columns: ['Overall rank', 'Country or region', 'Year']
Table 4 columns: ['Overall rank', 'Country or region', 'Year']
Table 5 columns: ['Overall rank', 'Country or region', 'Year']
Table 6 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 7 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 8 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 9 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 10 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 11 columns: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 12 columns: ['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 13 columns: ['Overall Rank', 'Change in rank', 'Country', 'Score', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 14 columns: ['Overall Rank [61][62]', 'Country', 'Score', 'Change Over Prior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Year']
Table 15 columns: ['Rank[64]', 'Country', 'Happiness', 'Year']
Table 16 columns: ['Rank[64]', 'Country', 'Happiness', 'Year']
Table 17 columns: ['vteLists of countries by quality of life rankings', 'vteLists of countries by quality of life rankings.1', 'Year']

What about the potential duplicate tables? Let’s take a look.

tables_with_years[0].head()

	Overall rank	Country or region	Life evaluation	Year
0	1	Finland	7.741	2024
1	2	Denmark	7.583	2024
2	3	Iceland	7.525	2024
3	4	Sweden	7.344	2024
4	5	Israel	7.341	2024

tables_with_years[1].head()

	Overall rank	Country or region	Life evaluation	Year
0	1	Finland	7.741	2024
1	2	Denmark	7.583	2024
2	3	Iceland	7.525	2024
3	4	Sweden	7.344	2024
4	5	Israel	7.341	2024

tables_with_years[0] == tables_with_years[1]

	Overall rank	Country or region	Life evaluation	Year
0	True	True	True	True
1	True	True	True	True
2	True	True	True	True
3	True	True	True	True
4	True	True	True	True
...	...	...	...	...
138	True	True	True	True
139	True	True	True	True
140	True	True	True	True
141	True	True	True	True
142	True	True	True	True

143 rows × 4 columns

It looks like we do have duplicated data. There are a few ways we can deal with this, but the easiest way is to leave things as they are and proceed. Once we’ve got everything in one clean dataframe, we can solve our duplicate data problem by dropping duplicate rows.

Our next steps, then, are as follows:

select the final dataframes,
align their column names,
concatenate them into one master dataframe, and
drop duplicate rows.

Let’s select the dataframes with a “Life evaluation” column.⁴ It appears that only two of our dataframes contain this data, but that’s due to inconsistent naming; the “Life evaluation” score is simply labelled “Score” in some dataframes. Additionally, some dataframes have a column called ‘Country or region’ and others have “Country.” Let’s align these column names.

⁴ Gallupe’s Life evaluation index is a measure of subjective wellbeing based on how people rate their current and expected future lives. Gallupe asks people to pick a number between 0 and 10 where 0 represents their worst possible life and 10 represents their best possible life. Participants say where they feel they are now, and where they thing they will be in 5 years. You can learn a bit more about the index from Gallupe, or by browsing the World Happiness Reports here.

new_columns = {
  "Country": "Country or region",
  "Score": "Life evaluation"
}

for df in tables_with_years:
    df.rename(columns=new_columns, inplace=True)

for i, table in enumerate(tables_with_years):
    print(f"Table {i} columns:", list(table.columns))

Table 0 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 1 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'Year']
Table 2 columns: ['Overall rank', 'Country or region', 'Year']
Table 3 columns: ['Overall rank', 'Country or region', 'Year']
Table 4 columns: ['Overall rank', 'Country or region', 'Year']
Table 5 columns: ['Overall rank', 'Country or region', 'Year']
Table 6 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 7 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 8 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 9 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Year']
Table 10 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 11 columns: ['Overall rank', 'Country or region', 'Life evaluation', 'GDP per capita', 'Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Unnamed: 8', 'Year']
Table 12 columns: ['Overall Rank', 'Change in rank', 'Country or region', 'Life evaluation', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 13 columns: ['Overall Rank', 'Change in rank', 'Country or region', 'Life evaluation', 'Change in score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Residual', 'Year']
Table 14 columns: ['Overall Rank [61][62]', 'Country or region', 'Life evaluation', 'Change Over Prior Year', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Trust', 'Year']
Table 15 columns: ['Rank[64]', 'Country or region', 'Happiness', 'Year']
Table 16 columns: ['Rank[64]', 'Country or region', 'Happiness', 'Year']
Table 17 columns: ['vteLists of countries by quality of life rankings', 'vteLists of countries by quality of life rankings.1', 'Year']

final_dfs = []
for df in tables_with_years:
    if 'Life evaluation' in df.columns:
        final_dfs.append(df[['Country or region', "Year", "Life evaluation"]])

final_df = pd.concat(final_dfs)
final_df.drop_duplicates(inplace=True)
final_df.dropna(inplace=True)
final_df = final_df.reset_index()
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 922 entries, 0 to 921
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              922 non-null    int64 
 1   Country or region  922 non-null    object
 2   Year               922 non-null    object
 3   Life evaluation    922 non-null    object
dtypes: int64(1), object(3)
memory usage: 28.9+ KB

It looks good, but there’s actually a problem lurking in this dataframe. If you inspect the output from info(), you’ll see that Life evaluation is actually an object / string, not a float. If we convert it to a float, most of our data gets converted to NaNs! 😲

will_have_nans = final_df.copy()
will_have_nans['Life evaluation'] = will_have_nans['Life evaluation'].str.strip()

# Try to reapply the extraction and conversion
will_have_nans['Life evaluation'] = will_have_nans['Life evaluation'].str.extract(r'([\d\.]+)')
will_have_nans['Life evaluation'] = pd.to_numeric(will_have_nans['Life evaluation'], errors='coerce')
will_have_nans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 922 entries, 0 to 921
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              922 non-null    int64  
 1   Country or region  922 non-null    object 
 2   Year               922 non-null    object 
 3   Life evaluation    157 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.9+ KB

What’s going on here!

If you print the unique values, you’ll see that Python recognizes some observations as floats while others are floats that Python thinks are strings. And then there is one observation that we can be sure is causing trouble: 5.305[b].

print(final_df['Life evaluation'].unique())

[7.741 7.583 7.525 7.344 7.341 7.319 7.302 7.122 7.06 7.057 7.029 6.955
 6.951 6.905 6.9 6.894 6.838 6.822 6.818 6.749 6.743 6.733 6.725 6.719
 6.678 6.611 6.609 6.594 6.561 6.523 6.503 6.491 6.469 6.448 6.442 6.421
 6.411 6.36 6.358 6.346 6.324 6.287 6.284 6.272 6.257 6.234 6.195 6.188
 6.068 6.06 6.058 6.048 6.043 6.03 6.017 5.977 5.976 5.975 5.973 5.968
 5.959 5.942 5.934 5.877 5.866 5.842 5.841 5.823 5.816 5.785 5.784 5.725
 5.714 5.707 5.696 5.695 5.607 5.568 5.463 5.455 5.422 5.369 5.364 5.316
 5.304 5.281 5.221 5.216 5.185 5.166 5.158 5.139 5.106 5.08 5.023 4.975
 4.969 4.923 4.893 4.881 4.879 4.874 4.873 4.832 4.795 4.657 4.556 4.548
 4.505 4.485 4.471 4.47 4.422 4.377 4.372 4.354 4.341 4.289 4.269 4.232
 4.228 4.214 4.186 4.054 3.977 3.898 3.886 3.861 3.781 3.566 3.561 3.502
 3.421 3.383 3.341 3.295 3.245 3.186 2.707 1.721 7.809 7.646 7.56 7.504
 7.488 7.449 7.353 7.3 7.294 7.238 7.232 7.223 7.165 7.129 7.121 7.094
 7.076 6.94 6.911 6.864 6.791 6.773 6.664 6.465 6.455 6.44 6.406 6.401
 6.399 6.387 6.377 6.376 6.363 6.348 6.325 6.305 6.281 6.258 6.228 6.227
 6.215 6.192 6.186 6.163 6.159 6.137 6.124 6.102 6.101 6.022 6.006 6.0
 5.999 5.953 5.95 5.925 5.911 5.89 5.872 5.871 5.797 5.778 5.747 5.693
 5.692 5.689 5.674 5.608 5.556 5.546 5.542 5.54 5.536 5.515 5.51 5.505
 5.489 5.456 5.384 5.353 5.286 5.233 5.198 5.194 5.165 5.16 5.148 5.137
 5.132 5.124 5.119 5.102 5.095 5.085 5.053 5.005 4.981 4.949 4.91 4.889
 4.883 4.848 4.833 4.829 4.814 4.785 4.772 4.769 4.751 4.729 4.724 4.677
 4.673 4.672 4.633 4.624 4.583 4.571 4.561 4.558 4.553 4.432 4.423 4.392
 4.375 4.327 4.311 4.308 4.187 4.166 4.151 3.926 3.775 3.759 3.721 3.653
 3.573 3.538 3.527 3.479 3.476 3.312 3.299 2.817 2.567 7.769 7.6 7.554
 7.494 7.48 7.343 7.307 7.278 7.246 7.228 7.167 7.139 7.09 7.054 7.021
 6.985 6.923 6.892 6.852 6.825 6.726 6.595 6.592 6.446 6.444 6.436 6.375
 6.374 6.354 6.321 6.3 6.293 6.262 6.253 6.223 6.199 6.198 6.182 6.174
 6.149 6.125 6.118 6.105 6.1 6.086 6.07 6.046 6.028 6.011 6.008 5.94 5.895
 5.893 5.888 5.886 5.86 5.809 5.779 5.758 5.743 5.718 5.697 5.653 5.648
 5.631 5.603 5.529 5.525 5.523 5.467 5.432 5.43 5.425 5.386 5.373 5.339
 5.323 5.287 5.285 5.274 5.265 5.261 5.247 5.211 5.208 5.197 5.192 5.191
 5.175 5.082 5.044 5.011 4.996 4.944 4.913 4.906 4.812 4.799 4.796 4.722
 4.719 4.707 4.7 4.696 4.681 4.668 4.639 4.628 4.587 4.559 4.534 4.519
 4.516 4.509 4.49 4.466 4.461 4.456 4.437 4.418 4.39 4.374 4.366 4.36 4.35
 4.332 4.286 4.212 4.189 4.107 4.085 4.015 3.975 3.973 3.933 3.802 3.663
 3.597 3.488 3.462 3.41 3.38 3.334 3.231 3.203 3.083 2.853 7.632 7.594
 7.555 7.495 7.487 7.441 7.328 7.324 7.314 7.272 7.19 7.072 6.977 6.965
 6.927 6.91 6.886 6.814 6.774 6.711 6.627 6.489 6.488 6.476 6.441 6.43
 6.419 6.388 6.382 6.379 6.371 6.343 6.322 6.31 6.26 6.173 6.167 6.141
 6.123 6.096 6.083 6.072 5.956 5.952 5.948 5.945 5.933 5.915 5.891 5.875
 5.835 5.81 5.79 5.762 5.752 5.739 5.681 5.663 5.662 5.64 5.636 5.62 5.566
 5.524 5.504 5.483 5.472 5.41 5.398 5.358 5.347 5.321 5.302 5.295 5.254
 5.246 5.201 5.199 5.161 5.155 5.131 5.129 5.125 5.103 5.093 4.982 4.933
 4.88 4.806 4.758 4.743 4.671 4.631 4.623 4.592 4.586 4.5 4.447 4.441
 4.433 4.424 4.419 4.417 4.41 4.356 4.34 4.321 4.301 4.245 4.19 4.161
 4.141 4.139 4.103 3.999 3.964 3.808 3.795 3.774 3.692 3.632 3.59 3.587
 3.582 3.495 3.408 3.355 3.303 3.254 2.905 '7.537' '7.522' '7.504' '7.494'
 '7.469' '7.377' '7.316' '7.314' '7.284' '7.213' '7.079' '7.006' '6.993'
 '6.977' '6.951' '6.891' '6.863' '6.714' '6.652' '6.648' '6.635' '6.609'
 '6.599' '6.578' '6.572' '6.527' '6.454' '6.452' '6.442' '6.424' '6.422'
 '6.403' '6.375' '6.357' '6.344' '6.168' '6.105' '6.098' '6.087' '6.084'
 '6.080' '6.071' '6.008' '6.003' '5.973' '5.971' '5.964' '5.963' '5.956'
 '5.920' '5.902' '5.872' '5.850' '5.838' '5.825' '5.823' '5.822' '5.819'
 '5.810' '5.758' '5.715' '5.629' '5.621' '5.611' '5.569' '5.525' '5.500'
 '5.493' '5.472' '5.430' '5.395' '5.336' '5.324' '5.311' '5.305[b]'
 '5.293' '5.279' '5.273' '5.269' '5.262' '5.250' '5.237' '5.235' '7.342'
 '5.230' '5.227' '5.225' '5.195' '5.182' '5.181' '5.175' '5.151' '5.074'
 '5.041' '5.011' '5.004' '4.962' '4.955' '4.829' '4.805' '4.775' '4.735'
 '4.714' '4.709' '4.695' '4.692' '4.644' '4.608' '4.574' '4.553' '4.550'
 '4.545' '4.535' '4.514' '4.497' '4.465' '4.460' '4.440' '4.376' '4.315'
 '4.292' '4.291' '4.286' '4.280' '4.190' '4.180' '4.168' '4.139' '4.120'
 '4.096' '4.081' '4.032' '4.028' '3.970' '3.936' '3.875' '3.808' '3.795'
 '3.794' '3.766' '3.657' '3.644' '3.603' '3.593' '3.591' '3.533' '3.507'
 '3.495' '3.471' '3.462' '3.349' '2.905' '2.693' 7.526 7.509 7.501 7.498
 7.413 7.404 7.339 7.334 7.313 7.291 7.267 7.119 7.104 7.087 7.039 6.994
 6.952 6.929 6.907 6.871 6.778 6.739 6.705 6.701 6.65 6.596 6.573 6.545
 6.481 6.478 6.474 6.361 6.355 6.269 6.239 6.218 6.168 6.084 6.078 6.005
 5.992 5.987 5.921 5.919 5.897 5.856 5.822 5.813 5.802 5.771 5.768 5.658
 5.615 5.56 5.538 5.528 5.517 5.488 5.458 5.44 5.401 5.389 5.314 5.303
 5.291 5.279 5.245 5.196 5.177 5.163 5.151 5.145 5.123 5.121 5.061 5.057
 5.045 5.033 4.907 4.876 4.875 4.871 4.813 4.793 4.754 4.655 4.643 4.635
 4.575 4.574 4.513 4.508 4.459 4.415 4.404 4.395 4.362 4.324 4.276 4.272
 4.252 4.236 4.219 4.217 4.201 4.193 4.156 4.121 4.073 4.028 3.974 3.956
 3.916 3.907 3.866 3.856 3.832 3.763 3.739 3.724 3.695 3.666 3.622 3.607
 3.515 3.484 3.36 3.069]

If we handle this situation carefully, we’ll get back the data we expect. To target our problematic observation(s), we’ll temporarily create a list, iterate over it to handle each observation as it needs to be handled, and then add the clean values back to our dataframe.

vals = final_df['Life evaluation'].tolist()
len(vals)

As we process these values, we’ll check their type. If it’s already a float, we’ll do nothing. If it’s a string, we’ll target the problem we know about and then attempt to convert it to a float. Then we’ll check the outcome.

cleaned = []

for val in vals:  
    if isinstance(val, float):  # Check if the value is already a float
        cleaned.append(val)
    elif isinstance(val, str):  # Check if the value is a string
        cleaned.append(float(val.replace('[b]', '')))  # Remove '[b]' and convert to float
    else:
        print(f"Unexpected type: {type(val)}")  # Handle unexpected types

final_df['Life evaluation'] = pd.Series(cleaned)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 922 entries, 0 to 921
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              922 non-null    int64  
 1   Country or region  922 non-null    object 
 2   Year               922 non-null    object 
 3   Life evaluation    922 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 28.9+ KB

We now have a float without missing values! 🔥😎

final_df.sort_values('Life evaluation', ascending=False, inplace=True)
final_df.head(30)

	index	Country or region	Year	Life evaluation
143	0	Finland	2020	7.809
296	0	Finland	2019	7.769
0	0	Finland	2024	7.741
144	1	Denmark	2020	7.646
452	0	Finland	2018	7.632
297	1	Denmark	2019	7.600
453	1	Norway	2018	7.594
1	1	Denmark	2024	7.583
145	2	Switzerland	2020	7.560
454	2	Denmark	2018	7.555
298	2	Norway	2019	7.554
608	0	Norway	2017	7.537
765	0	Denmark	2016	7.526
2	2	Iceland	2024	7.525
609	1	Denmark	2017	7.522
766	1	Switzerland	2016	7.509
610	2	Iceland	2017	7.504
146	3	Iceland	2020	7.504
767	2	Iceland	2016	7.501
768	3	Norway	2016	7.498
455	3	Iceland	2018	7.495
611	3	Switzerland	2017	7.494
299	3	Iceland	2019	7.494
147	4	Norway	2020	7.488
300	4	Netherlands	2019	7.488
456	4	Switzerland	2018	7.487
301	5	Switzerland	2019	7.480
612	4	Finland	2017	7.469
148	5	Netherlands	2020	7.449
457	5	Netherlands	2018	7.441

Visualization

Before we move on to an example of collecting data from an API, let’s create a visualization of the data we just scraped and processed. We’re going to create a small multiples plot. The code below looks a bit complex, but it’s just calculating the size of the figure needed and then looping over each country to create a plot.

sorted_data = final_df.sort_values(['Country or region', 'Year'])
sorted_data = sorted_data[sorted_data['Country or region'] != "World"]
countries = sorted_data['Country or region'].unique()

# Define the number of rows and columns for the subplots
n_cols = 4  # You can adjust this based on how many plots you want per row
n_rows = len(countries) // n_cols + (len(countries) % n_cols > 0)

# Create the figure and subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows * 2.5), constrained_layout=True, sharex=True, sharey=True)

# Flatten the axes array for easier indexing
axes = axes.flatten()

# Loop through each country and create a subplot
for i, country in enumerate(countries):
    country_data = sorted_data[sorted_data['Country or region'] == country]
    axes[i].plot(country_data['Year'], country_data['Life evaluation'], marker='o')
    axes[i].set_title(f'\n{country}', fontsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].set_xlim(final_df['Year'].min(), final_df['Year'].max())
    axes[i].set_ylim(0, final_df['Life evaluation'].max() + 1)
    
    axes[i].set_yticks(
      np.arange(
        int(final_df['Life evaluation'].min()), 
        int(final_df['Life evaluation'].max()) + 1, 
        1
      )
    )  
    axes[i].tick_params(axis='x', labelsize=12)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].grid(True)

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.savefig("output/happiness_report_small_multiples.png", dpi=300)

Conclusion

In this tutorial, we learned how to scrape data from a static website by working with the Wikipedia page on the World Happiness Report. We learned how to:

Load a static website with requests
Parse HTML with BeautifulSoup to extract headers, body text, and links
Handle relative and internal links
Extract and process HTML tables, including handling nested tables
Clean and structure the extracted data, preparing it for analysis
Create a small multiples plot using matplotlib and seaborn

We’ve just scratched the surface of what’s possible with web scraping. With the skills you’ve learned here, you’ll be able to collect data from a wide variety of static web pages. If you want to learn more, I recommend consulting Mitchell (2024) and McLevey (2022). In the next tutorial, we’ll learn about collecting data from APIs, which are another powerful tool in your data collection toolkit!

Mitchell, Ryan. 2024. Web Scraping with Python. 3rd ed. O’Reilly.

McLevey, John. 2022. Doing Computational Social Science: A Practical Introduction. Sage.

Reuse

CC BY-SA 4.0