Creating Graphs, Tables, & Maps

SOCI 3040 – Quantitative Research Methods

Class No.

9

Class Date

February 4, 2025

  Reading Assignment

  • Required: Ch 5
  • Recommended: Ch 4

  Class / Lab Notes

Graphs are an essential component of data storytelling. They help us recognize patterns, understand distributions, and communicate findings effectively. This notebook introduces best practices for graphing in R using ggplot2 and demonstrates how summary statistics can sometimes be misleading if we do not visualize our data.

0.1 Learning Objectives

By the end of this session, students will: - Understand the importance of graphing raw data before relying on summary statistics. - Learn to use ggplot2 for scatterplots, bar charts, line plots, and faceted plots. - Explore different themes, color palettes, and aesthetic modifications. - Apply these techniques to real-world datasets.

0.2 Setup: Loading Required Packages

# Load necessary libraries
library(tidyverse) # Core tidyverse package for data wrangling & visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4     βœ” readr     2.1.5
βœ” forcats   1.0.0     βœ” stringr   1.5.1
βœ” ggplot2   3.5.1     βœ” tibble    3.2.1
βœ” lubridate 1.9.3     βœ” tidyr     1.3.1
βœ” purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(datasauRus) # Fun dataset illustrating the importance of visualization
library(ggplot2) # Graphing package
library(janitor) # Cleaning column names

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(WDI) # Accessing World Bank economic indicators
library(carData) # British Election Panel Study dataset
library(patchwork) # Combining multiple plots
library(tidygeocoder) # Geocoding support
library(tinytable) # Nice formatted tables

# Set theme for all plots
theme_set(theme_minimal())

0.3 Why Graphing Your Data is Important

0.3.1 Example 1: The Datasaurus Dozen

The datasaurus_dozen dataset illustrates why we should always plot our data instead of relying solely on summary statistics.

# Display the dataset
datasaurus_dozen
# A tibble: 1,846 Γ— 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# β„Ή 1,836 more rows

Each subset of this dataset has the same mean and standard deviation for x and y, yet their visual patterns are completely different.

0.3.1.1 Computing Summary Statistics

This code below is a pipeline that processes the datasaurus_dozen dataset to compute and display summary statistics (mean and standard deviation) for four selected datasets: β€œdino”, β€œstar”, β€œaway”, and β€œbullseye”. The pipeline begins by using filter(dataset %in% c(β€œdino”, β€œstar”, β€œaway”, β€œbullseye”)), which subsets the data to only include these four specific datasets. The summarise() function then calculates the mean and standard deviation for both the x and y variables, applying the across() function to compute these statistics for each dataset separately. The .by = dataset argument ensures that the summary statistics are grouped by dataset, so each subset receives its own computed values.

After computing the summary statistics, the code formats the output into a visually appealing table. The tt() function (from the tinytable package) is used to create a neatly formatted table, and style_tt(j = 2:5, align = β€œr”) aligns the numeric columns (x mean, x sd, y mean, y sd) to the right for better readability. The format_tt(digits = 1, num_fmt = β€œdecimal”) function ensures that numerical values are displayed with one decimal place. Finally, setNames(c(β€œDataset”, β€œx mean”, β€œx sd”, β€œy mean”, β€œy sd”)) renames the columns to more descriptive labels for clarity. This pipeline efficiently extracts, summarizes, and presents key statistical insights from the datasaurus_dozen dataset while emphasizing the importance of looking beyond summary statistics to understand data distributions visually.

datasaurus_dozen |>
    filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
    summarise(
        across(c(x, y), list(mean = mean, sd = sd)),
        .by = dataset
    ) |>
    tt() |>
    style_tt(j = 2:5, align = "r") |>
    format_tt(digits = 1, num_fmt = "decimal") |>
    setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
Dataset x mean x sd y mean y sd
dino 54.3 16.8 47.8 26.9
away 54.3 16.8 47.8 26.9
star 54.3 16.8 47.8 26.9
bullseye 54.3 16.8 47.8 26.9

πŸ‘‰ Key Takeaway: These datasets appear identical in summary statistics, but let’s plot them.

Recall that the mean, or average, is a measure of central tendency that represents the typical value in a dataset. It is calculated by adding up all the values and dividing by the total number of values. The mean gives us a sense of where most of the data points are centered. The standard deviation measures how spread out the values are from the mean. If the standard deviation is small, most values are close to the mean; if it’s large, the values are more spread out.

In short:

  • low standard deviation = data points are close together
  • high standard deviation = data points are more spread out

These concepts help us understand how typical or how varied our data is.

0.3.1.2 Visualizing the Datasaurus Dozen

# Plot the datasets
datasaurus_dozen |>
    filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
    ggplot(aes(x = x, y = y, colour = dataset)) +
    geom_point() +
    facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
    labs(color = "Dataset")

πŸ‘‰ Observation: Despite having identical summary statistics, each dataset has a distinct shape!

0.3.2 Example 2: Anscombe’s Quartet

Frank Anscombe developed Anscombe’s Quartet to highlight the same issue.

head(anscombe)
  x1 x2 x3 x4   y1   y2    y3   y4
1 10 10 10  8 8.04 9.14  7.46 6.58
2  8  8  8  8 6.95 8.14  6.77 5.76
3 13 13 13  8 7.58 8.74 12.74 7.71
4  9  9  9  8 8.81 8.77  7.11 8.84
5 11 11 11  8 8.33 9.26  7.81 8.47
6 14 14 14  8 9.96 8.10  8.84 7.04

This dataset contains four sets of (x, y) values that share identical means, variances, and regression lines.

0.3.2.1 Tidying the Data

We use pivot_longer() to convert it into tidy format.

tidy_anscombe <- anscombe |> pivot_longer(
    everything(),
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
)
tidy_anscombe
# A tibble: 44 Γ— 3
   set       x     y
   <chr> <dbl> <dbl>
 1 1        10  8.04
 2 2        10  9.14
 3 3        10  7.46
 4 4         8  6.58
 5 1         8  6.95
 6 2         8  8.14
 7 3         8  6.77
 8 4         8  5.76
 9 1        13  7.58
10 2        13  8.74
# β„Ή 34 more rows

This code reshapes the anscombe dataset into a tidy format using the pivot_longer() function. The tidy format (or tidy data) is a structured way of organizing data where each row represents an observation, each column represents a variable, and each cell contains a single value. This format, introduced by Hadley Wickham, makes data easier to manipulate, visualize, and analyze using tools like ggplot2 and dplyr. For example, in a non-tidy format, you might have separate columns for x1, y1, x2, y2, etc. (like in Anscombe’s Quartet). In a tidy format, you would restructure the data so that there are only three columns: set (indicating the dataset), x, and y, with each row representing one observation. Tidy data is particularly useful because it works seamlessly with the tidyverse, allowing for easier grouping, filtering, summarizing, and plotting.

The everything() argument ensures that all columns in the dataset are transformed. The names_to = c(β€œ.value”, β€œset”) argument tells pivot_longer() to split the original column names into two parts: one representing the variable (x or y) and the other representing the dataset number (1, 2, 3, or 4). The names_pattern = β€œ(.)(.)” uses regular expressions to separate column names based on their structure (e.g., x1, y1 β†’ x, y for dataset 1). As a result, the tidy_anscombe dataset now has three columns: set (identifying the dataset number), x (the independent variable), and y (the dependent variable). This transformation makes the data more structured and easier to work with, particularly for grouped analysis and visualization in ggplot2.

0.3.2.2 Computing Summary Statistics

tidy_anscombe |>
    summarise(
        across(c(x, y), list(mean = mean, sd = sd)),
        .by = set
    ) |>
    tt() |>
    style_tt(j = 2:5, align = "r") |>
    format_tt(digits = 1, num_fmt = "decimal") |>
    setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
Dataset x mean x sd y mean y sd
1 9 3.3 7.5 2
2 9 3.3 7.5 2
3 9 3.3 7.5 2
4 9 3.3 7.5 2

0.3.2.3 Visualizing Anscombe’s Quartet

tidy_anscombe |>
    ggplot(aes(x = x, y = y, colour = set)) +
    geom_point() +
    geom_smooth(method = lm, se = FALSE) +
    facet_wrap(vars(set), nrow = 2, ncol = 2) +
    labs(colour = "Dataset")
`geom_smooth()` using formula = 'y ~ x'

πŸ‘‰ Insight: Again, summary statistics don’t tell the full story!

0.4 Bar Charts: Comparing Categorical Variables

We now explore bar charts using the British Election Panel Study.

0.4.1 Loading and Cleaning the Data

beps <-
    BEPS |>
    as_tibble() |>
    clean_names() |>
    select(age, vote, gender, political_knowledge)

0.4.2 Creating Age Groups

beps <- beps |>
    mutate(
        age_group = case_when(
            age < 35 ~ "<35",
            age < 50 ~ "35-49",
            age < 65 ~ "50-64",
            age < 80 ~ "65-79",
            age < 100 ~ "80-99"
        ),
        age_group = factor(age_group, levels = c("<35", "35-49", "50-64", "65-79", "80-99"))
    )

0.4.3 Plotting the Distribution of Age Groups

beps |> ggplot(aes(x = age_group)) +
    geom_bar() +
    labs(x = "Age group", y = "Number of respondents")

0.5 Scatterplots: Exploring Relationships Between Variables

Using World Bank Data, we analyze GDP growth and inflation.

0.5.1 Downloading the Data

world_bank_data <- WDI(
    indicator = c("FP.CPI.TOTL.ZG", "NY.GDP.MKTP.KD.ZG"),
    country = c("AU", "ET", "IN", "US")
) |>
    rename(inflation = FP.CPI.TOTL.ZG, gdp_growth = NY.GDP.MKTP.KD.ZG)

0.5.2 Plotting GDP Growth vs. Inflation

world_bank_data |>
    ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
    geom_point() +
    labs(x = "GDP Growth", y = "Inflation", color = "Country")
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

0.6 Line Plots: Time-Series Data

Let’s analyze US GDP growth over time.

world_bank_data |>
    filter(country == "United States") |>
    ggplot(aes(x = year, y = gdp_growth)) +
    geom_line() +
    labs(x = "Year", y = "GDP Growth")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

1 Remember…

  • Always visualize your data!
  • Summary statistics alone can be misleading.
  • Use scatterplots, bar charts, and line plots to explore relationships.

On Thursday, we’ll continue with Graphs, Tables, and Maps.