| N | mean | SD | |
|---|---|---|---|
| x1 | 11 | 9.00 | 3.32 |
| x2 | 11 | 9.00 | 3.32 |
| x3 | 11 | 9.00 | 3.32 |
| x4 | 11 | 9.00 | 3.32 |
| y1 | 11 | 7.50 | 2.03 |
| y2 | 11 | 7.50 | 2.03 |
| y3 | 11 | 7.50 | 2.03 |
| y4 | 11 | 7.50 | 2.03 |
The Institute for Evaluation of Labour Market and Education Policy (IFAU)
2026-05-28
The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey, The Future of Data Analysis (1962)
Quote popularised via Tufte (1983)
Snow (1855)
Snow (1855)
| N | mean | SD | |
|---|---|---|---|
| x1 | 11 | 9.00 | 3.32 |
| x2 | 11 | 9.00 | 3.32 |
| x3 | 11 | 9.00 | 3.32 |
| x4 | 11 | 9.00 | 3.32 |
| y1 | 11 | 7.50 | 2.03 |
| y2 | 11 | 7.50 | 2.03 |
| y3 | 11 | 7.50 | 2.03 |
| y4 | 11 | 7.50 | 2.03 |
Anscombe (1973)
Anscombe (1973)
Exploration
Communication
Same grammar, different goals. Most of this lecture is about communication.

Matejka and Fitzmaurice (2017)
Workflow adapted from Wickham et al. (2023)
Verification tasks:
View() or just print it)Verifying the variable wage, we might ask:
Verifying the variable wage, we might ask:
I want to study health. Health records are a (reasonably) accurate measure of health care, but a biased measure of health.
[Graphical excellence] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space
Edward Tufte
Tufte (1983)
fohm_dt <- fread(here("lectures", "lecture_10", "fohm_c19_death_data.csv"))[!is.na(date)]
ggplot(data = fohm_dt[publication_date == "2020-04-13"],
aes(x = date, y = N)) +
geom_col() + scale_x_date() +
geom_vline(xintercept = as.Date("2020-04-08")) + geom_hline(yintercept = 70) +
coord_cartesian(xlim = as.Date(c("2020-03-01", "2020-04-30")), ylim = c(0, 120)) +
labs(title = "Publication date: 2020-04-13", x = NULL, y = "Deaths")
ggplot(data = fohm_dt[publication_date %between% list("2020-04-13", "2020-05-15")],
aes(x = date, y = N, group = date)) +
geom_col() + scale_x_date() +
geom_vline(xintercept = as.Date("2020-04-08")) + geom_hline(yintercept = 70) +
coord_cartesian(xlim = as.Date(c("2020-03-01", "2020-04-30")), ylim = c(0, 120)) +
labs(title = "Publication date: {frame_time}", x = NULL, y = "Deaths") +
transition_time(publication_date) + ease_aes('linear')290 municipalities. You can tell rates fell, but not much else.
highlighted <- c("Danderyd", "Stockholm", "Malmö", "Pajala")
ggplot(panel_dt, aes(x = year, y = unemployment_rate, group = municipality_code)) +
geom_line(color = "grey80", linewidth = 0.4) +
geom_line(data = panel_dt[municipality_name %in% highlighted],
aes(color = municipality_name), linewidth = 1.2) +
scale_color_brewer(palette = "Dark2", name = NULL) +
labs(x = NULL, y = "Unemployment rate (%)")Background keeps the distribution. Foreground tells a story.
Healy (2018)
Mach bands, adapted from Healy (2018)
ggplot intro
Plot types
Grouping and summarizing
Styling
Wrapping up
ggplot2 package| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Iraq | Asia | 1952 | 45.32000 | 5441766 | 4129.7661 |
| West Bank and Gaza | Asia | 1952 | 43.16000 | 1030585 | 1515.5923 |
| Mexico | Americas | 1992 | 71.45500 | 88111030 | 9472.3843 |
| China | Asia | 1977 | 63.96736 | 943455000 | 741.2375 |
| Turkey | Europe | 1992 | 66.14600 | 58179144 | 5678.3483 |
| Vietnam | Asia | 1962 | 45.36300 | 33796140 | 772.0492 |
| Nigeria | Africa | 1967 | 41.04000 | 47287752 | 1014.5141 |
| Botswana | Africa | 1987 | 63.62200 | 1151184 | 6205.8839 |
| Paraguay | Americas | 2007 | 71.75200 | 6667147 | 4172.8385 |
| Cuba | Americas | 1972 | 70.72300 | 8831348 | 5305.4453 |
We set the aesthetic mapping of the ggplot object to columns of the gapminder data frame.
To draw something on the canvas we need to add a geometry layer. For example a scatter with geom_point().

geom_point() is a shortcut for layer(...). Setting mapping and data to NULL means they are inherited from p_box.
Let’s add a boxplot instead to study the distribution of continuous variables across multiple groups.
To get a more visual sense of where the data is located we can re-add the actual data points.

Highlighting outliers and making the points less prominent. Alpha means transparency.

Starting the y-axis at zero usually good. Here, I also added a simple theme and formatted the axis labels.
Tag points outside Q1 − 1.5·IQR or Q3 + 1.5·IQR per continent, then split inliers and outliers when we plot.
p_box +
geom_boxplot(outlier.color = "red") +
geom_jitter(data = plotdata[outlier == FALSE],
position = position_jitter(width = 0.1, height = 0),
alpha = 0.25) +
scale_y_continuous(n.breaks = 5,
limits = c(0, 100),
expand = expansion(c(0,0.05))) +
labs(y = "Life expectancy",
x = "Continent") +
theme_bw() +
geom_text_repel(data = unique(plotdata[outlier == TRUE], by = "country"),
aes(label = country))
library(gganimate)
ggplot(copy(gapminder_dt)[continent != "Oceania", ],
aes(x = gdpPercap, y = lifeExp, size = pop, color = country)) +
geom_point(alpha = 0.6, show.legend = FALSE) +
scale_size(range = c(2, 13)) +
scale_x_log10(limits = c(150, 115000),
labels = scales::comma) +
facet_wrap(vars(continent)) +
coord_fixed(ratio = 1 / 43) +
labs(title = 'Year: {frame_time}',
x = 'GDP per capita', y = 'Life expectancy') +
theme_minimal() +
transition_time(year) +
ease_aes('linear')ggplot intro
Plot types
Grouping and summarizing
Styling
Wrapping up
geom_point): relationshipsgeom_line): time seriesgeom_col): comparisonsgeom_histogram, geom_density): distributionsgeom_boxplot): grouped distributionsgeom_smooth, geom_errorbar): presenting resultsLets start by working with a really simple dataset with two groups:
geom_point()geom_col()This does not look right…
geom_col()Default is position = "stack", this puts same x values on top of each other. Setting it to dodge separates overlapping values.
Collective geoms are plots with connected observations. group tells ggplot how to connect the data.
geom_histogram()geom_smooth()By default geom_smooth() fits a loess curve. Pass method = "lm" for a straight line.
geom_errorbar()Useful for reporting e.g., coefficient plots, but requires ymin and ymax aesthetics.
ggplot intro
Plot types
Grouping and summarizing
Styling
Wrapping up
What if we wanted to plot how GDP per capita has developed for each country over time. A line plot should do this well.
Any idea what’s wrong?
group aestheticWe need to tell ggplot to group the data by country.
Still quite hard to see what’s going on!
Let’s color lines by continent.

Still looks cluttered.
Instead we could split the plot into subplots by continent.

By default facet_wrap() keeps the x and y scales identical across panels. Pass scales = "free_y", "free_x", or "free" to let them vary.
ggplot intro
Plot types
Grouping and summarizing
Styling
Wrapping up
How can we increase readability?

A figure that needs explaining is half-finished. Titles, captions, and a line of narrative do the rest.
theme() sets look and feel8% of men and 0.5% of women have some form of color blindness. Let’s create a function to evaluate how different palettes look for people who are color blind.
library(dichromat)
library(paletteer)
colorblind_palette = function(palette) {
melt(as.data.table(
append(
list(x = seq_along(palette),
"Trichromacy (Original)" = palette),
lapply(c("Protanopia"="protan", "Deutanopia"="deutan", "Tritanopia"="tritan"),
dichromat, colours = palette)
)
), id.vars = "x") |>
ggplot(aes(x=x, y = 0, fill = value)) +
geom_raster() + facet_wrap(vars(variable), nrow=4) +
scale_fill_identity() + theme_void() + theme(legend.position = "none")
}Let’s go with this one for our plot!
ggsave()ggsave(<filename>, p) saves the p plot object to a filewidth, height, and scale to get the size rightggplot intro
Plot types
Grouping and summarizing
Styling
Wrapping up
Tools that will move fast before next year:
Habits that move slowly:
