APIs, Scraping, and Parsing

SOCI 3040 – Quantitative Research Methods

Class No.

15

Class Date

March 4, 2025

  Reading Assignment

  • Required: Ch 8
  • Recommended: Ch 8

  Class / Lab Notes

1 API Example: OpenAlexR

Use the “Polite Pool” if you have an API Key!

Our goal will be to retrieve publications by researchers at Memorial University who have authored more than 10 publications (indexed in OpenAlex) using the OpenAlex API, and to begin working with the data objects returned from the API.

Note that some of the code in this notebook will take some time to run, especially when downloading data. Be patient, and change the query parameters if you must.

We’ll load the openalexR and tidyverse packages. Don’t forget to install openalexR in your PostCloud if you haven’t already done so!

library(openalexR)
Thank you for using openalexR!
To acknowledge our work, please cite the package by calling `citation("openalexR")`.
To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

1.1 Finding Memorial’s OpenAlex Institutional ID

To start, we need to find Memorial’s “Institutional ID”. We can do this by using the OpenAlex website.

OpenAlex search

Click on the Institution button and start typing “Memorial University” into the OpenAlex search.

Search for Memorial University under “Institution”

Take a moment to review the results page – it’s the same page we saw when searching authors and topics in previous classes, only this time it’s for all of Memorial University. As of March 2025, there are almost 50,000 indexed publications.

The results page

Click on “Memorial University” in the search field.

Finding an Institutional ID

You should see an Institution ID under “Memorial University of Newfoundland.” I’ve highlighted it below to make it easier to see. This is the number we want to use when we query the OpenAlexAPI!

Copying the Institutional ID for an API query

You can click to view MUN’s “Institutional Profile” on Open Alex.

An OpenAlex institution profile page

1.2 Query the OpenAlex API

Now we know that Memorial’s Institutional ID is i130438778. We can use this to setup an API query. Let’s start by creating a list with our query parameters.

Recall that a list in R is like a container that holds several pieces of information. In this case, our list will contain three key-value pairs:

  • entity = "authors" tells the API that we are interested in data about authors.
  • last_known_institutions.id = "i130438778" specifies that we want authors who are or were affiliated with a particular institution (identified by “i130438778”).
  • works_count = ">10" means we want authors who have more than 10 published works.

Let’s set it up!

my_arguments <- list(
    entity = "authors",
    last_known_institutions.id = "i130438778",
    works_count = ">10"
)

Now let’s we’ll make an API call!

do.call(oa_fetch, c(my_arguments, list(count_only = TRUE)))
     count db_response_time_ms page per_page
[1,]  2307                 169    1        1

do.call() is a function that calls another function – in this case oa_fetch – using a list of arguments.

Here, we combine our my_arguments list with an extra argument list(count_only = TRUE). This tells the function oa_fetch to only return a count (the number of matching records) instead of all the detailed information OpenAlex has about our search results. Essentially, we’re asking the API: “How many authors match these criteria?

So now we’ve defined the search criteria for authors in a list, and we’ve checked if there are any authors matching the criteria by asking for just a count. If there are results (i.e., there is 1 or more authors in the OpenAlex database), then we can make another request to collect detailed information for those authors.

Run the code block below.

if (do.call(oa_fetch, c(my_arguments, list(count_only = TRUE)))[1] > 0) {
    do.call(oa_fetch, my_arguments) |>
        show_authors() |>
        knitr::kable()
}
Warning: Unknown or uninitialised column: `name`.
Warning: Unknown or uninitialised column: `display_name`.
Warning: Unknown or uninitialised column: `name`.
Warning: Unknown or uninitialised column: `display_name`.
Warning: Unknown or uninitialised column: `name`.
Warning: Unknown or uninitialised column: `display_name`.
id display_name orcid works_count cited_by_count affiliation_display_name top_concepts
A5026509039 Fereidoon Shahidi 0000-0002-9912-0330 1113 79559 Memorial University of Newfoundland Biochemistry, Animal Science and Zoology, Molecular Biology
A5077149719 Octavia A. Dobre 0000-0001-8528-0512 949 20777 Memorial University of Newfoundland Electrical and Electronic Engineering, Electrical and Electronic Engineering, Electrical and Electronic Engineering
A5010428924 David C. Schneider 0000-0003-4771-2155 798 6532 Memorial University of Newfoundland Global and Planetary Change, Insect Science, Ecology, Evolution, Behavior and Systematics
A5049089848 Trung Q. Duong 0000-0002-4703-4836 726 18629 Memorial University of Newfoundland Electrical and Electronic Engineering, Computer Networks and Communications, Electrical and Electronic Engineering
A5032069177 Ian Fleming 0000-0002-5541-824X 701 23299 Memorial University of Newfoundland Organic Chemistry, Organic Chemistry, Nature and Landscape Conservation
A5102982237 Robert A. Brown 0000-0003-2350-110X 528 18534 Memorial University of Newfoundland Materials Chemistry, Astronomy and Astrophysics, Astronomy and Astrophysics

In the code above, we check if the count returned by the previous call is greater than zero by using an if statement. Then do.call(oa_fetch, c(my_arguments, list(count_only = TRUE)))[1] retrieves the first element of the result, which is the number of matching authors. If that number is greater than 0 (meaning there is at least one author that meets the criteria), then the code inside the if block ({ ... }) block will run.

Inside the if block, do.call(oa_fetch, my_arguments) calls the oa_fetch function again with our original arguments, only this time we don’t use count_only = TRUE. This tells to API to fetch all the details of the matching authors. We then use the pipe operator |> to pass the results of that function on to the next function, show_authors(). The show_authors() function formats or selects relevant author information. Finally, we pass the formatted data to knitr::kable(), which converts it into a nicely formatted table.

1.3 Store the Data!

So far our code fetches data and processes it for display, but it doesn’t retain the data in memory or write it to disk. If we want to keep it in memory and analyze that data in some way, we need to assign the result to a variable. Recall from previous classes that we want to minimize our APIs calls to be considerate of the servers providing our data.

Let’s do that now, making yet another API call.

authors_data <- do.call(oa_fetch, my_arguments)

Now our data in stored in authors_data. Let’s take a look at the column names.

names(authors_data)
 [1] "id"                        "display_name"             
 [3] "display_name_alternatives" "ids"                      
 [5] "orcid"                     "works_count"              
 [7] "cited_by_count"            "counts_by_year"           
 [9] "affiliation_display_name"  "affiliation_id"           
[11] "affiliation_ror"           "affiliation_country_code" 
[13] "affiliation_type"          "affiliation_lineage"      
[15] "affiliations_other"        "topics"                   
[17] "works_api_url"            

There are 17 variables for us to work with here! We’ll focus on a few today, including display_name, cited_by_count, works_count, counts_by_year, and topics.

We can print a preview of the tibble like any other. Let’s print 30 rows:

print(authors_data, n = 30)
# A tibble: 2,312 × 17
   id                display_name display_name_alterna…¹ ids   orcid works_count
   <chr>             <chr>        <list>                 <lis> <chr>       <int>
 1 https://openalex… Fereidoon S… <chr [3]>              <chr> http…        1113
 2 https://openalex… Octavia A. … <chr [7]>              <chr> http…         949
 3 https://openalex… David C. Sc… <chr [6]>              <chr> http…         798
 4 https://openalex… Trung Q. Du… <chr [5]>              <chr> http…         726
 5 https://openalex… Ian Fleming  <chr [8]>              <chr> http…         701
 6 https://openalex… Robert A. B… <chr [6]>              <chr> http…         528
 7 https://openalex… Weimin Huang <chr [10]>             <chr> http…         525
 8 https://openalex… Laurence K.… <chr [6]>              <chr> <NA>          523
 9 https://openalex… G.F. Naterer <chr [4]>              <chr> http…         523
10 https://openalex… Proton Rahm… <chr [5]>              <chr> http…         492
11 https://openalex… David G. Be… <chr [6]>              <chr> http…         485
12 https://openalex… David Molyn… <chr [6]>              <chr> http…         478
13 https://openalex… Lynn H. Ger… <chr [13]>             <chr> http…         454
14 https://openalex… M. P. Searle <chr [9]>              <chr> http…         434
15 https://openalex… Baiyu Zhang  <chr [9]>              <chr> http…         430
16 https://openalex… Steven M. R… <chr [8]>              <chr> http…         424
17 https://openalex… Rosemary Ri… <chr [5]>              <chr> http…         412
18 https://openalex… E. Jacobsen  <chr [4]>              <chr> <NA>          394
19 https://openalex… Lev Tarasov  <chr [4]>              <chr> http…         390
20 https://openalex… Sohrab Zend… <chr [3]>              <chr> http…         376
21 https://openalex… Patrick S. … <chr [5]>              <chr> http…         348
22 https://openalex… Neil Bose    <chr [7]>              <chr> http…         344
23 https://openalex… Jian‐Bin Lin <chr [4]>              <chr> http…         336
24 https://openalex… John T. Bro… <chr [5]>              <chr> http…         332
25 https://openalex… Jie Xiao     <chr [5]>              <chr> http…         318
26 https://openalex… Yuming Zhao  <chr [4]>              <chr> http…         317
27 https://openalex… M. Tariq Iq… <chr [15]>             <chr> http…         315
28 https://openalex… Michael Lei… <chr [3]>              <chr> http…         308
29 https://openalex… Abir U. Iga… <chr [6]>              <chr> http…         300
30 https://openalex… Rachel Berk… <chr [5]>              <chr> http…         299
# ℹ 2,282 more rows
# ℹ abbreviated name: ¹​display_name_alternatives
# ℹ 11 more variables: cited_by_count <int>, counts_by_year <list>,
#   affiliation_display_name <chr>, affiliation_id <chr>,
#   affiliation_ror <chr>, affiliation_country_code <chr>,
#   affiliation_type <chr>, affiliation_lineage <chr>,
#   affiliations_other <list>, topics <list>, works_api_url <chr>

Let’s take a look at the topics data first. We’ll get a sense of what is in here and think about how to filter it to a smaller set of results that interest us.

One way to proceed is to use the pull() function from the tidyverse. If you run the code below, you’ll see a LOT of text populate your screen – R is printing 2307 dataframes! I won’t print the results here, but you can!

authors_data %>% pull(topics)

When we we pipe authors_data into pull(topics), we get the contents of the topics column as a vector. Vectors are useful for lots of things, including quick computations, plotting, or applying vectorized functions. Since it’s a simple vector, there’s no extra metadata like column names or row indices.

That’s not always what we want. Instead, we could use the select() function from the tidyverse to get back a tibble containing the column we want. Because it’s a tibble, it preserves additional information such as column names, types, and row names (implicitly). The tibble still has the structure of a table, so we can see the column name and work with it in the context of other columns! And keeping our data in a tibble format makes it easier to perform further data manipulations, or join with other tibbles, since many tidyverse functions expect data to be in a tibble format.

authors_data %>% select(topics)

This is a complex data structure! Each publication in our dataset has a tibble stored in the topics column! Nested dataframes! Oh my.

authors_data %>%
    select(topics) %>%
    .[[1]] %>%
    head(10) %>%
    print()
[[1]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   312 topic    https://openalex.org/T10035         Phytochemicals and …
 2     1   312 subfield https://openalex.org/subfields/2704 Biochemistry        
 3     1   312 field    https://openalex.org/fields/27      Medicine            
 4     1   312 domain   https://openalex.org/domains/4      Health Sciences     
 5     2   228 topic    https://openalex.org/T10333         Meat and Animal Pro…
 6     2   228 subfield https://openalex.org/subfields/1103 Animal Science and …
 7     2   228 field    https://openalex.org/fields/11      Agricultural and Bi…
 8     2   228 domain   https://openalex.org/domains/1      Life Sciences       
 9     3   165 topic    https://openalex.org/T11561         Protein Hydrolysis …
10     3   165 subfield https://openalex.org/subfields/1312 Molecular Biology   
# ℹ 90 more rows

[[2]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   303 topic    https://openalex.org/T11458         Advanced Wireless C…
 2     1   303 subfield https://openalex.org/subfields/2208 Electrical and Elec…
 3     1   303 field    https://openalex.org/fields/22      Engineering         
 4     1   303 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2   186 topic    https://openalex.org/T10148         Advanced MIMO Syste…
 6     2   186 subfield https://openalex.org/subfields/2208 Electrical and Elec…
 7     2   186 field    https://openalex.org/fields/22      Engineering         
 8     2   186 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3   116 topic    https://openalex.org/T10851         Optical Wireless Co…
10     3   116 subfield https://openalex.org/subfields/2208 Electrical and Elec…
# ℹ 90 more rows

[[3]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1    57 topic    https://openalex.org/T10230         Marine and fisherie…
 2     1    57 subfield https://openalex.org/subfields/2306 Global and Planetar…
 3     1    57 field    https://openalex.org/fields/23      Environmental Scien…
 4     1    57 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2    53 topic    https://openalex.org/T10135         Insect-Plant Intera…
 6     2    53 subfield https://openalex.org/subfields/1109 Insect Science      
 7     2    53 field    https://openalex.org/fields/11      Agricultural and Bi…
 8     2    53 domain   https://openalex.org/domains/1      Life Sciences       
 9     3    35 topic    https://openalex.org/T12329         Hemiptera Insect St…
10     3    35 subfield https://openalex.org/subfields/1105 Ecology, Evolution,…
# ℹ 90 more rows

[[4]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   265 topic    https://openalex.org/T10148         Advanced MIMO Syste…
 2     1   265 subfield https://openalex.org/subfields/2208 Electrical and Elec…
 3     1   265 field    https://openalex.org/fields/22      Engineering         
 4     1   265 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2   261 topic    https://openalex.org/T10796         Cooperative Communi…
 6     2   261 subfield https://openalex.org/subfields/1705 Computer Networks a…
 7     2   261 field    https://openalex.org/fields/17      Computer Science    
 8     2   261 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3   161 topic    https://openalex.org/T11458         Advanced Wireless C…
10     3   161 subfield https://openalex.org/subfields/2208 Electrical and Elec…
# ℹ 90 more rows

[[5]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   210 topic    https://openalex.org/T10013         Asymmetric Synthesi…
 2     1   210 subfield https://openalex.org/subfields/1605 Organic Chemistry   
 3     1   210 field    https://openalex.org/fields/16      Chemistry           
 4     1   210 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2   165 topic    https://openalex.org/T11549         Synthetic Organic C…
 6     2   165 subfield https://openalex.org/subfields/1605 Organic Chemistry   
 7     2   165 field    https://openalex.org/fields/16      Chemistry           
 8     2   165 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3   133 topic    https://openalex.org/T10302         Fish Ecology and Ma…
10     3   133 subfield https://openalex.org/subfields/2309 Nature and Landscap…
# ℹ 90 more rows

[[6]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1    86 topic    https://openalex.org/T11087         Solidification and …
 2     1    86 subfield https://openalex.org/subfields/2505 Materials Chemistry 
 3     1    86 field    https://openalex.org/fields/25      Materials Science   
 4     1    86 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2    69 topic    https://openalex.org/T10039         Stellar, planetary,…
 6     2    69 subfield https://openalex.org/subfields/3103 Astronomy and Astro…
 7     2    69 field    https://openalex.org/fields/31      Physics and Astrono…
 8     2    69 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3    68 topic    https://openalex.org/T10325         Astro and Planetary…
10     3    68 subfield https://openalex.org/subfields/3103 Astronomy and Astro…
# ℹ 90 more rows

[[7]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   188 topic    https://openalex.org/T11061         Ocean Waves and Rem…
 2     1   188 subfield https://openalex.org/subfields/1910 Oceanography        
 3     1   188 field    https://openalex.org/fields/19      Earth and Planetary…
 4     1   188 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2   107 topic    https://openalex.org/T10891         Radar Systems and S…
 6     2   107 subfield https://openalex.org/subfields/2202 Aerospace Engineeri…
 7     2   107 field    https://openalex.org/fields/22      Engineering         
 8     2   107 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3    93 topic    https://openalex.org/T10255         Oceanographic and A…
10     3    93 subfield https://openalex.org/subfields/1910 Oceanography        
# ℹ 90 more rows

[[8]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   242 topic    https://openalex.org/T10612         Magnetism in coordi…
 2     1   242 subfield https://openalex.org/subfields/2504 Electronic, Optical…
 3     1   242 field    https://openalex.org/fields/25      Materials Science   
 4     1   242 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2   216 topic    https://openalex.org/T11881         Crystallization and…
 6     2   216 subfield https://openalex.org/subfields/2505 Materials Chemistry 
 7     2   216 field    https://openalex.org/fields/25      Materials Science   
 8     2   216 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3   215 topic    https://openalex.org/T12613         X-ray Diffraction i…
10     3   215 subfield https://openalex.org/subfields/2505 Materials Chemistry 
# ℹ 90 more rows

[[9]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   120 topic    https://openalex.org/T11802         Chemical Looping an…
 2     1   120 subfield https://openalex.org/subfields/2204 Biomedical Engineer…
 3     1   120 field    https://openalex.org/fields/22      Engineering         
 4     1   120 domain   https://openalex.org/domains/3      Physical Sciences   
 5     2    78 topic    https://openalex.org/T10998         Heat Transfer and O…
 6     2    78 subfield https://openalex.org/subfields/2210 Mechanical Engineer…
 7     2    78 field    https://openalex.org/fields/22      Engineering         
 8     2    78 domain   https://openalex.org/domains/3      Physical Sciences   
 9     3    54 topic    https://openalex.org/T12696         Icing and De-icing …
10     3    54 subfield https://openalex.org/subfields/2202 Aerospace Engineeri…
# ℹ 90 more rows

[[10]]
# A tibble: 100 × 5
       i count name     id                                  display_name        
   <int> <int> <chr>    <chr>                               <chr>               
 1     1   250 topic    https://openalex.org/T11092         Spondyloarthritis S…
 2     1   250 subfield https://openalex.org/subfields/2745 Rheumatology        
 3     1   250 field    https://openalex.org/fields/27      Medicine            
 4     1   250 domain   https://openalex.org/domains/4      Health Sciences     
 5     2   211 topic    https://openalex.org/T10469         Psoriasis: Treatmen…
 6     2   211 subfield https://openalex.org/subfields/2403 Immunology          
 7     2   211 field    https://openalex.org/fields/24      Immunology and Micr…
 8     2   211 domain   https://openalex.org/domains/1      Life Sciences       
 9     3   178 topic    https://openalex.org/T10200         Rheumatoid Arthriti…
10     3   178 subfield https://openalex.org/subfields/2745 Rheumatology        
# ℹ 90 more rows

If we print some rows from the topics tibble, we can see that it contains information on topic, subfield, field, and domain. We’ll focus on topics for now and will come back to fields later.

authors_data %>%
    select(topics) %>%
    unnest(topics) %>%
    distinct(display_name) %>%
    print(n = 30)
# A tibble: 4,286 × 1
   display_name                                
   <chr>                                       
 1 Phytochemicals and Antioxidant Activities   
 2 Biochemistry                                
 3 Medicine                                    
 4 Health Sciences                             
 5 Meat and Animal Product Quality             
 6 Animal Science and Zoology                  
 7 Agricultural and Biological Sciences        
 8 Life Sciences                               
 9 Protein Hydrolysis and Bioactive Peptides   
10 Molecular Biology                           
11 Biochemistry, Genetics and Molecular Biology
12 Antioxidant Activity and Oxidative Stress   
13 Fatty Acid Research and Health              
14 Nutrition and Dietetics                     
15 Nursing                                     
16 Edible Oils Quality and Analysis            
17 Organic Chemistry                           
18 Chemistry                                   
19 Physical Sciences                           
20 Aquaculture Nutrition and Growth            
21 Aquatic Science                             
22 Free Radicals and Antioxidants              
23 Advanced Chemical Sensor Technologies       
24 Biomedical Engineering                      
25 Engineering                                 
26 Tea Polyphenols and Effects                 
27 Pathology and Forensic Medicine             
28 Nuts composition and effects                
29 Phytoestrogen effects and research          
30 Food Chemistry and Fat Analysis             
# ℹ 4,256 more rows

Let’s filter the tibble to find authors (rows in authors_data) who have worked on the topic of “migration.”

Let’s unpack the code below.

First, we create a variable called search_topic and assign it the string “migration”. We will use this value later to filter the data. Then we pipe the authors_data into the select() function, where we select the display_name column (containing author names) and the topics column, which contained nested tibbles with information about the topics on which a specific author has published.

We then pipe the author names and topics into the unnest() function, which unnests the topics column. In other words, we will pull the topics tibble out of the row from the dataframe and expand it so that it is it’s own tibble with separate rows. The names_sep = "_" parameter specifies that* when unnesting, any new column names coming from the nested structure should be concatenated with the original column name using an underscore.* For example, if the nested tibble has a column named display_name, it might become topics_display_name. Why does that matter? Because the nested tibble has columns with the same names as the tibble it’s embedded in, which can cause… chaos.

Next, we pipe the unnested data into the filter() function. We use it to filter the rows in author_data based on whether the topics_display_name column contains the text stored in search_topic (which is “migration”). To make that happen, we use the str_detect() function to check if a string contains a specific pattern. We use a “regular expression” that ignores case differences by using the argument ignore_case = TRUE (so “Migration” or “migration” both match).

Now we can display a list of unique authors who meet our search and filter criteria by piping the results into distinct() and then printing the first n results (in this case, 80).

search_topic <- "migration"

authors_data %>%
    select(display_name, topics) %>%
    unnest(topics, names_sep = "_") %>%
    filter(str_detect(topics_display_name, regex(search_topic, ignore_case = TRUE))) %>%
    distinct(display_name) %>%
    print(n = 80)
# A tibble: 160 × 1
   display_name             
   <chr>                    
 1 Sian Neilson             
 2 Roger White              
 3 Ratana Chuenpagdee       
 4 Tony Fang                
 5 Trevor Bell              
 6 Lewis R. Fischer         
 7 Eric Y. Tenkorang        
 8 Barbara Neis             
 9 R. J. Avery              
10 Yang Zhen                
11 Amin A. Muhammad Gadit   
12 Ashlee Cunsolo           
13 Derek Nurse              
14 Mark C. J. Stoddart      
15 Kelly Vodden             
16 Stephen Bornstein        
17 Lisa Philpott            
18 Marilyn Porter           
19 Shree Mulay              
20 Tyler D. Eddy            
21 Yanqing Yi               
22 Diana L. Gustafson       
23 Alex Stewart             
24 Ben Burt                 
25 A. Lukyn Williams        
26 Sharon R. Roseman        
27 Jennifer A. Selby        
28 Gillian Kolla            
29 Sulaimon Gıwa            
30 Diane Tye                
31 Victor Maddalena         
32 Christopher P. Youé      
33 Fern Brunger             
34 KL MacPherson            
35 David Close              
36 Sonja Boon               
37 T.E. Roche               
38 Robert Ormsby            
39 Benjamin Rich Zendel     
40 Jane G. Zhu              
41 Adrian Tanner            
42 Lisa Rankin              
43 Alexander Y. Shestopaloff
44 Stephen Czarnuch         
45 Dale Kirby               
46 Martha Traverso-Yépez    
47 María Andrée López Gómez 
48 Nancy Pedri              
49 Rick Audas               
50 Alan Hall                
51 M. J. Anderson           
52 Delores V. Mullings      
53 Rochelle R. Côté         
54 Gordon B. Cooke          
55 Roselyne N. Okech        
56 Bill Bigelow             
57 Isabelle Côté            
58 Gerald Mugford           
59 Katherine Side           
60 Nicholas Wells           
61 Jaro Stacul              
62 Maisam Najafizada        
63 Susan Stuckless          
64 Peter Narváez            
65 Jean L. Briggs           
66 Pauline Duke             
67 Sarah Gander             
68 Ahmed Afzal              
69 Yorck Sommerhäuser       
70 Natalie Beausoleil       
71 Rainer Baehre            
72 Valerie Burton           
73 Frederick Johnstone      
74 Yolande Pottie‐Sherman   
75 Nathalie LaCoste         
76 Hollý                    
77 Hugh Whalen              
78 Lisa‐Jo K. van den Scott 
79 Martin Lovelace          
80 Jeanne Sinclair          
# ℹ 80 more rows

What are we looking at here? These are authors affiliated with Memorial University who have published at least one paper on the topic of “migration.” It also prints the author’s number of publications and citations (as indexed by OpenAlex). We got that data by developing a small pipeline that:

  1. Starts with our original dataset.
  2. Focuses on just the author names and their topics.
  3. Expands nested topic information into individual rows.
  4. Filters rows where the topic matches “migration”.
  5. Removes duplicate author names.
  6. Prints the results, showing up to 80 rows.

Now… change the search_topic above to search for other topics. Try “climate change” (or whatever)!

1.4 More than Names

OK, cool! But, once again, maybe we want to see some other information, like maybe the name of the author, the title of the publication, and the number of topics assigned to that publication. Let’s do that below and print the top 200 results.

search_topic <- "migration"

authors_data %>%
    select(display_name, topics, works_count, cited_by_count) %>%
    unnest(topics, names_sep = "_") %>%
    filter(str_detect(topics_display_name, regex(search_topic, ignore_case = TRUE))) %>%
    select(display_name, topics_display_name, topics_count, topics_display_name) %>%
    print(n = 200)
# A tibble: 207 × 3
    display_name                topics_display_name                 topics_count
    <chr>                       <chr>                                      <int>
  1 Sian Neilson                European Law and Migration                     4
  2 Sian Neilson                Human Rights and Immigration                   3
  3 Roger White                 Migration and Labor Dynamics                  31
  4 Roger White                 Migration, Ethnicity, and Economy             25
  5 Ratana Chuenpagdee          Climate Change, Adaptation, Migrat…            5
  6 Tony Fang                   Migration and Labor Dynamics                  30
  7 Tony Fang                   Migration, Ethnicity, and Economy             27
  8 Tony Fang                   Diaspora, migration, transnational…            5
  9 Trevor Bell                 Climate Change, Adaptation, Migrat…            4
 10 Lewis R. Fischer            Migration, Policy, and Dickens Stu…            1
 11 Eric Y. Tenkorang           Migration and Labor Dynamics                   3
 12 Eric Y. Tenkorang           Migration, Health and Trauma                   2
 13 Barbara Neis                Migration, Aging, and Tourism Stud…            9
 14 Barbara Neis                Migration and Labor Dynamics                   4
 15 R. J. Avery                 Migration, Aging, and Tourism Stud…            4
 16 Yang Zhen                   China's Global Influence and Migra…            1
 17 Amin A. Muhammad Gadit      Migration, Health and Trauma                   4
 18 Ashlee Cunsolo              Climate Change, Adaptation, Migrat…            9
 19 Derek Nurse                 Diaspora, migration, transnational…            1
 20 Mark C. J. Stoddart         Climate Change, Adaptation, Migrat…            3
 21 Kelly Vodden                Migration, Aging, and Tourism Stud…            5
 22 Stephen Bornstein           Migration, Health and Trauma                   2
 23 Lisa Philpott               Migration, Ethnicity, and Economy              2
 24 Marilyn Porter              Migration, Ethnicity, and Economy              5
 25 Marilyn Porter              Migration and Labor Dynamics                   3
 26 Shree Mulay                 Migration, Health and Trauma                   3
 27 Tyler D. Eddy               Climate Change, Adaptation, Migrat…            3
 28 Yanqing Yi                  Migration, Health and Trauma                   7
 29 Diana L. Gustafson          Migration, Health and Trauma                   9
 30 Alex Stewart                Migration, Ethnicity, and Economy             14
 31 Ben Burt                    Climate Change, Adaptation, Migrat…            6
 32 Ben Burt                    Italian Social Issues and Migration            1
 33 Ben Burt                    Diaspora, migration, transnational…            1
 34 A. Lukyn Williams           European Law and Migration                     2
 35 Sharon R. Roseman           Migration, Aging, and Tourism Stud…           12
 36 Sharon R. Roseman           Migration and Labor Dynamics                   6
 37 Sharon R. Roseman           Migration, Ethnicity, and Economy              5
 38 Sharon R. Roseman           Diaspora, migration, transnational…            4
 39 Sharon R. Roseman           Immigration and Intercultural Educ…            2
 40 Sharon R. Roseman           Migration, Refugees, and Integrati…            2
 41 Jennifer A. Selby           Multiculturalism, Politics, Migrat…           29
 42 Jennifer A. Selby           Migration, Identity, and Health                3
 43 Gillian Kolla               Migration, Health and Trauma                   2
 44 Sulaimon Gıwa               Migration and Labor Dynamics                   5
 45 Sulaimon Gıwa               Migration, Ethnicity, and Economy              4
 46 Sulaimon Gıwa               Migration, Health and Trauma                   3
 47 Sulaimon Gıwa               Migration, Refugees, and Integrati…            3
 48 Sulaimon Gıwa               Migration, Identity, and Health                2
 49 Diane Tye                   Migration, Ethnicity, and Economy              1
 50 Victor Maddalena            Migration, Health and Trauma                   2
 51 Christopher P. Youé         Migration, Ethnicity, and Economy              1
 52 Fern Brunger                Migration, Health and Trauma                   3
 53 KL MacPherson               China's Global Influence and Migra…            2
 54 David Close                 Diaspora, migration, transnational…            2
 55 Sonja Boon                  Migration, Ethnicity, and Economy              2
 56 T.E. Roche                  Macrophage Migration Inhibitory Fa…            2
 57 Robert Ormsby               Migration, Policy, and Dickens Stu…            1
 58 Benjamin Rich Zendel        Migration, Aging, and Tourism Stud…            1
 59 Jane G. Zhu                 Migration, Ethnicity, and Economy              2
 60 Adrian Tanner               Multiculturalism, Politics, Migrat…            2
 61 Adrian Tanner               Climate Change, Adaptation, Migrat…            1
 62 Lisa Rankin                 Migration, Aging, and Tourism Stud…            1
 63 Alexander Y. Shestopaloff   Migration, Health and Trauma                   2
 64 Stephen Czarnuch            Migration, Health and Trauma                   2
 65 Dale Kirby                  Migration, Ethnicity, and Economy              2
 66 Dale Kirby                  Migration and Labor Dynamics                   2
 67 Martha Traverso-Yépez       Migration, Racism, and Human Rights            2
 68 María Andrée López Gómez    Migration, Aging, and Tourism Stud…            2
 69 Nancy Pedri                 Diaspora, migration, transnational…            1
 70 Rick Audas                  Migration and Labor Dynamics                   2
 71 Alan Hall                   Migration, Identity, and Health                1
 72 M. J. Anderson              Macrophage Migration Inhibitory Fa…            1
 73 Delores V. Mullings         Migration and Labor Dynamics                   6
 74 Delores V. Mullings         Migration, Health and Trauma                   4
 75 Delores V. Mullings         Migration, Ethnicity, and Economy              3
 76 Rochelle R. Côté            Migration, Ethnicity, and Economy              6
 77 Rochelle R. Côté            Migration and Labor Dynamics                   3
 78 Gordon B. Cooke             Migration, Aging, and Tourism Stud…            2
 79 Roselyne N. Okech           Migration, Ethnicity, and Economy              2
 80 Bill Bigelow                Migration, Ethnicity, and Economy              1
 81 Isabelle Côté               Migration and Labor Dynamics                   4
 82 Isabelle Côté               Migration, Refugees, and Integrati…            4
 83 Isabelle Côté               Diaspora, migration, transnational…            2
 84 Isabelle Côté               Migration, Identity, and Health                1
 85 Gerald Mugford              Migration, Aging, and Tourism Stud…            2
 86 Katherine Side              Migration, Refugees, and Integrati…            4
 87 Nicholas Wells              Migration, Aging, and Tourism Stud…            2
 88 Jaro Stacul                 Italian Social Issues and Migration            1
 89 Maisam Najafizada           Migration, Health and Trauma                   2
 90 Susan Stuckless             Migration, Aging, and Tourism Stud…            1
 91 Peter Narváez               Migration, Ethnicity, and Economy              1
 92 Jean L. Briggs              Migration, Education, Indigenous S…            2
 93 Pauline Duke                Migration, Health and Trauma                   2
 94 Sarah Gander                Migration, Health and Trauma                   4
 95 Ahmed Afzal                 Diaspora, migration, transnational…            3
 96 Ahmed Afzal                 Migration and Labor Dynamics                   2
 97 Ahmed Afzal                 Migration, Ethnicity, and Economy              2
 98 Yorck Sommerhäuser          Migration, Ethnicity, and Economy              1
 99 Natalie Beausoleil          Multiculturalism, Politics, Migrat…            1
100 Rainer Baehre               Migration, Ethnicity, and Economy              1
101 Rainer Baehre               Migration and Labor Dynamics                   1
102 Valerie Burton              Migration, Ethnicity, and Economy              3
103 Frederick Johnstone         Migration, Ethnicity, and Economy              3
104 Yolande Pottie‐Sherman      Migration and Labor Dynamics                   7
105 Yolande Pottie‐Sherman      Migration, Ethnicity, and Economy              6
106 Yolande Pottie‐Sherman      Migration, Refugees, and Integrati…            5
107 Yolande Pottie‐Sherman      Diaspora, migration, transnational…            3
108 Yolande Pottie‐Sherman      Migration, Aging, and Tourism Stud…            3
109 Nathalie LaCoste            Multiculturalism, Politics, Migrat…            1
110 Hollý                       Migration and Labor Dynamics                   1
111 Hugh Whalen                 Multiculturalism, Politics, Migrat…            1
112 Lisa‐Jo K. van den Scott    Diaspora, migration, transnational…            1
113 Martin Lovelace             Migration, Ethnicity, and Economy              1
114 Jeanne Sinclair             Migration, Aging, and Tourism Stud…            1
115 Christopher Patey           Italian Social Issues and Migration            1
116 Dominique Brégent‐Heald     Migration, Health, Geopolitics, Hi…            2
117 Daze Jefferies              Migration, Ethnicity, and Economy              1
118 David Peddle                Human Rights and Immigration                   1
119 Stephen Harold Riggins      Diaspora, migration, transnational…            2
120 Stephen Harold Riggins      Migration, Ethnicity, and Economy              1
121 James Valcour               Climate Change, Adaptation, Migrat…            1
122 Mercedes Steedman           Migration, Ethnicity, and Economy              2
123 Ivan Emke                   European Law and Migration                     1
124 Ivan Emke                   Migration, Health and Trauma                   1
125 Robin Whitaker              Migration, Refugees, and Integrati…            4
126 Robin Whitaker              Multiculturalism, Politics, Migrat…            1
127 Lincoln Addison             Migration, Ethnicity, and Economy              3
128 Lincoln Addison             Migration and Labor Dynamics                   1
129 Elizabeth Yeoman            Migration, Refugees, and Integrati…            1
130 Alessandro Giardino         Multiculturalism, Politics, Migrat…            1
131 D. Codner                   Macrophage Migration Inhibitory Fa…            1
132 Lesley Butler               Diaspora, migration, transnational…            1
133 Lesley Butler               Migration, Ethnicity, and Economy              1
134 Lesley Butler               Climate Change, Adaptation, Migrat…            1
135 Julia Temple Newhook        Migration, Ethnicity, and Economy              1
136 Julia Temple Newhook        Migration and Labor Dynamics                   1
137 Donald W. Nichol            Migration, Policy, and Dickens Stu…            1
138 John Mannion                Climate Change, Adaptation, Migrat…            1
139 Santé A. Viselli            Multiculturalism, Politics, Migrat…            1
140 Ronald Schwartz             Diaspora, migration, transnational…            1
141 Brenda A. LeFrançois        Labour Market and Migration                    1
142 Roza Tchoukaleyska          Migration, Identity, and Health                1
143 Heidi Coombs-Thorne         Migration and Labor Dynamics                   1
144 Robert Shea                 Migration, Refugees, and Integrati…            1
145 Paul Alhassan Issahaku      Migration, Health and Trauma                   1
146 Amy M. Warren               Migration, Aging, and Tourism Stud…            1
147 Rowena Mercado              Migration, Health and Trauma                   1
148 Kwamina Abekah‐Carter       Migration, Aging, and Tourism Stud…            2
149 Angela J. Hyde              Macrophage Migration Inhibitory Fa…            1
150 Sylvia Moore                Climate Change, Adaptation, Migrat…            1
151 Christopher Curran          Migration and Labor Dynamics                   4
152 Katie Gillespie             Migration, Health and Trauma                   4
153 Catherine Losier            Migration, Identity, and Health                4
154 Catherine Losier            Migration, Health, Geopolitics, Hi…            1
155 Barry C. Gaulton            Migration, Health, Geopolitics, Hi…            1
156 August Carbonella           Migration, Ethnicity, and Economy              4
157 August Carbonella           Multiculturalism, Politics, Migrat…            1
158 Nicholas Lynch              Migration, Aging, and Tourism Stud…            1
159 Russell Dawe                Migration, Health and Trauma                   1
160 Kodjo Attikpoé              Migration and Exile Studies                    1
161 Lorna Bennett               Migration, Health and Trauma                   1
162 Jennifer L. Buckle          Migration, Health and Trauma                   3
163 Mariya Lesiv                Diaspora, migration, transnational…            2
164 Mariya Lesiv                Migration, Ethnicity, and Economy              1
165 John Bodner                 Multiculturalism, Politics, Migrat…            1
166 Calvin Hollett              Migration, Aging, and Tourism Stud…            1
167 A. K. M. Shahidullah        Climate Change, Adaptation, Migrat…            1
168 Rebecca J. Franklin         Migration, Ethnicity, and Economy              2
169 Beth Leavenworth DuFault    Migration, Ethnicity, and Economy              1
170 Leanna Butters              Migration, Aging, and Tourism Stud…            5
171 Raquel Ruiz‐Díaz            Climate Change, Adaptation, Migrat…            3
172 Jacqueline Hesson           Migration, Health and Trauma                   1
173 Jennifer Thorburn           Diaspora, migration, transnational…            7
174 Jennifer Thorburn           Migration and Labor Dynamics                   7
175 Madonna M. Murphy           Migration, Health and Trauma                   1
176 Roberta Buchanan            Multiculturalism, Politics, Migrat…            1
177 Sean W. D. Gray             Migration, Refugees, and Integrati…            3
178 Cory W. Thorne              Diaspora, migration, transnational…            1
179 Michael Skipton             Migration, Policy, and Dickens Stu…            2
180 Neil J. Vincent             Migration, Health and Trauma                   1
181 Devonne Ryan                Migration, Aging, and Tourism Stud…            1
182 Margo Wilson                Migration, Health and Trauma                   1
183 Michael D. Kirkpatrick      Migration, Ethnicity, and Economy              1
184 Kate Lahey                  Climate Change, Adaptation, Migrat…            1
185 Elias Bartellas             Migration, Health and Trauma                   1
186 Mohamed Salah Eddine Madiou Climate Change, Adaptation, Migrat…            1
187 Elise Thorburn              Migration, Aging, and Tourism Stud…            1
188 Leslie J. Cake              Migration, Aging, and Tourism Stud…            1
189 Hua Que                     Migration, Health and Trauma                   6
190 Hua Que                     Migration and Labor Dynamics                   3
191 Hua Que                     Migration, Refugees, and Integrati…            2
192 Caroline Guinard            Migration, Identity, and Health                1
193 Ban Younghusband            Macrophage Migration Inhibitory Fa…            1
194 Jessica Squires             Macrophage Migration Inhibitory Fa…            1
195 Tyler R. Pritchard          Migration, Health and Trauma                   2
196 Raleen Murphy               Migration, Health and Trauma                   2
197 Darren Hynes                Migration, Health, Geopolitics, Hi…            1
198 Halia Koo                   Multiculturalism, Politics, Migrat…            1
199 Alka Agarwal-Mawal          Macrophage Migration Inhibitory Fa…            1
200 Jieying Xiong               Macrophage Migration Inhibitory Fa…            1
# ℹ 7 more rows

Let’s store this subset of data for later.

migration_authors <- authors_data %>%
    select(display_name, topics, works_count, cited_by_count) %>%
    unnest(topics, names_sep = "_") %>%
    filter(str_detect(topics_display_name, regex(search_topic, ignore_case = TRUE))) %>%
    select(display_name, topics_display_name, topics_count, topics_display_name)

1.5 Key Concepts and Common Words

What are some of the key concepts that show up in research on migration conducted by Memorial researchers? One simple way to get at this idea is to simply take every unique word that appears across titles and count the number of times it appears. It’s crude, but a useful first pass to get a sense of what we have.

To do this, we’ll use another package: tidytext for “natural language processing”. We’ll load the library and then start our pipeline by piping the migration_authors data into the tidytext’s unnest_tokens() function. unnest_tokens() splits the topics_display_name column into individual words (tokens). Each word becomes a separate row in the dataset, and the new column is named word. Then we pipe that output into count(word, sort = TRUE) to counts the occurrences of each unique word in the word column. The sort = TRUE argument sorts the results in descending order of frequency.

library(tidytext)

word_counts <- migration_authors %>%
    unnest_tokens(word, topics_display_name) %>%
    count(word, sort = TRUE)

print(word_counts, n = 50)
# A tibble: 48 × 2
   word                 n
   <chr>            <int>
 1 migration          204
 2 and                149
 3 health              44
 4 economy             36
 5 ethnicity           36
 6 trauma              33
 7 studies             26
 8 dynamics            24
 9 identity            23
10 labor               23
11 aging               21
12 tourism             21
13 adaptation          16
14 change              16
15 climate             16
16 diaspora            16
17 transnational       16
18 gender              12
19 multiculturalism    12
20 politics            12
21 integration         10
22 refugees            10
23 factor               9
24 inhibitory           9
25 macrophage           9
26 dickens              4
27 geography            4
28 geopolitics          4
29 historical           4
30 human                4
31 policy               4
32 rights               4
33 social               4
34 european             3
35 immigration          3
36 issues               3
37 italian              3
38 law                  3
39 china's              2
40 education            2
41 global               2
42 influence            2
43 racism               2
44 exile                1
45 indigenous           1
46 intercultural        1
47 labour               1
48 market               1

1.6 Visualizing Word Frequencies

Let’s make a horizontal bar graph (where the words are on the y-axis and word frequency is on the x-axis) to visualize this distribution of words. We’ll also remove “stop words” like “and,” “of,” etc.

data("stop_words")
filtered_word_counts <- word_counts %>%
    anti_join(stop_words, by = "word")

And now for the bar graph!

ggplot(filtered_word_counts, aes(x = reorder(word, n), y = n)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(
        title = "Word Frequency in Migration Research",
        x = "Words",
        y = "Frequency"
    ) +
    theme_minimal()

Remember you can save your plot with the ggsave function!

ggsave("migration_word_frequency.png", width = 10, height = 8)

1.7 Fields

The topics data also include “field” classifications. These are multi-level labels attached to publications. By multi-level I mean that some labels are very high-level (e.g., social sciences, sociology) while others are most focused (e.g., sociology of gender, racial inequality). Let’s make a table counting the number of fields in the Memorial Data.

field_counts <- authors_data %>%
    select(topics) %>%
    unnest(topics) %>%
    count(display_name, sort = TRUE)

print(field_counts, n = 30)
# A tibble: 4,286 × 2
   display_name                                             n
   <chr>                                                <int>
 1 Physical Sciences                                    19166
 2 Social Sciences                                      19166
 3 Health Sciences                                      11998
 4 Medicine                                              9391
 5 Life Sciences                                         7869
 6 Engineering                                           5640
 7 Environmental Science                                 3950
 8 Biochemistry, Genetics and Molecular Biology          3752
 9 Arts and Humanities                                   2521
10 Computer Science                                      2376
11 Earth and Planetary Sciences                          2334
12 Health Professions                                    2119
13 Agricultural and Biological Sciences                  2083
14 Molecular Biology                                     2002
15 Sociology and Political Science                       1961
16 Psychology                                            1939
17 Chemistry                                             1401
18 General Health Professions                            1401
19 Materials Science                                     1192
20 Education                                             1188
21 Neuroscience                                          1166
22 Ecology                                               1149
23 Physics and Astronomy                                 1121
24 Electrical and Electronic Engineering                  943
25 Surgery                                                905
26 Genetics                                               895
27 Public Health, Environmental and Occupational Health   894
28 Economics, Econometrics and Finance                    886
29 Business, Management and Accounting                    832
30 Global and Planetary Change                            776
# ℹ 4,256 more rows

A few things jump out at me from this list of the top 30 topics. First, publication output in the physical and social sciences at Memorial are neck and neck! If you were to combine Health, Medicine, and Life Sciences, they would top the list. Arts and Humanities is pretty high on this list to, claiming the number 9 rank. Sociology and Political Science are at rank 15.

Let’s ggplot!

top_n_fields <- field_counts %>%
    top_n(100, n)

ggplot(top_n_fields, aes(x = reorder(display_name, n), y = n)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(
        title = "Top 30 Fields in Memorial University Publications",
        x = "Fields",
        y = "Count"
    ) +
    theme_minimal()

We’ll stop there for today. Tomorrow we’ll work on the second Data Stories assignment in class. It’s due on Monday March 10th.