Extracting Body Text
Similarly, we can extract all paragraphs (<p>
tags) from the HTML document. We’ll collect the text from each paragraph into a list.
= []
body_text for paragraph in soup.find_all(['p']):
body_text.append(paragraph.get_text())
print(f'Found {len(body_text)} paragraphs')
Found 88 paragraphs
There are 88 paragraphs in this document. Let’s preview the first 5.
5] body_text[:
['This is an old revision of this page, as edited by 41.189.206.7 (talk) at 08:42, 19 August 2024 (→\u200e2023 report). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.',
'The World Happiness Report is a publication that contains articles and rankings of national happiness, based on respondent ratings of their own lives,[1] which the report also correlates with various (quality of) life factors.[2] The report primarily uses data from the Gallup World Poll. As of March 2024, Finland has been ranked the happiest country in the world seven times in a row.[3][4][5][6][7]\n',
'Since 2024, the report has been published under a partnership between Gallup, the Wellbeing Research Centre at the University of Oxford, and the UN Sustainable Development Solutions Network.[8] The editorial team includes three founding editors, John F. Helliwell, Richard Layard, and Jeffrey D. Sachs, and editors, Jan-Emmanuel De Neve, Lara Aknin, and Shun Wang.[9]\n',
'In July 2011, the UN General Assembly adopted resolution 65/309 Happiness: Towards a Holistic Definition of Development[10] inviting member countries to measure the happiness of their people and to use the data to help guide public policy. \n',
'The first World Happiness Report was released on 1 April 2012, as a foundational text for the UN High Level Meeting: Well-being and Happiness: Defining a New Economic Paradigm,[11] drawing international attention.[12] On 2 April 2012, this was followed by the first UN High Level Meeting called Wellbeing and Happiness: Defining a New Economic Paradigm,[13] which was chaired by UN Secretary General Ban Ki-moon and Prime Minister Jigmi Thinley of Bhutan, a nation that adopted gross national happiness instead of gross domestic product as their main development indicator.[14]\n']
Extracting text like this is useful for text analysis, which we’ll cover in the next module. For now, we’ll create a dataframe from this data
= pd.DataFrame(body_text)
ordered_text = ['paragraph']
ordered_text.columns 'sequence'] = ordered_text.index.to_list()
ordered_text[
10) ordered_text.head(
paragraph | sequence | |
---|---|---|
0 | This is an old revision of this page, as edite... | 0 |
1 | The World Happiness Report is a publication th... | 1 |
2 | Since 2024, the report has been published unde... | 2 |
3 | In July 2011, the UN General Assembly adopted ... | 3 |
4 | The first World Happiness Report was released ... | 4 |
5 | The first report outlined the state of world h... | 5 |
6 | The rankings of national happiness are based o... | 6 |
7 | The life factor variables used in the reports ... | 7 |
8 | The use of subjective measurements of wellbein... | 8 |
9 | In the reports, experts in fields including ec... | 9 |
and write it to disk for later use.
ordered_text.to_csv('output/happiness_report_wikipedia_paragraphs.csv', index=False
)