Lecture 7: Data Wrangling II

Adam Altmejd

adam.altmejd@su.se

The Institute for Evaluation of Labour Market and Education Policy (IFAU)

2026-05-05

Today

Joins / merges
Reshaping wide <-> long
Manipulating strings and date time objects
Applying functions to vectors

Lecture 6 gave us the basic grammar

DT[i, j] to subset and compute
add by= to summarize and transform by group
Today we focus on structure: join and reshape
We will also talk about how to manipulate the data content: strings, regex, and dates

Joining

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

Joining: matching rows across tables

Two tables share one or more key columns
A join uses those keys to line up matching rows
The usual goal is to add columns from one table to another

SQL join types

Inner Join: Keep only rows with matching keys in both tables

Left (outer) join: Keep left table, add matches from the right

Full outer join: Keep all rows from both tables

Anti join: Keep rows from left table that have no match in the right

Primary key vs foreign key

A (primary) key: uniquely identifies rows
A foreign key: another table’s primary key
breed_info$breed should be unique
dogs$breed is allowed to repeat

Most join bugs are key bugs

A join matches rows using key columns
If the key is wrong, the join is wrong
If a key is duplicated unexpectedly, rows can multiply
If keys do not match, rows can silently pick up NA

Joining in data.table

data.table has two alternatives:

merge(x, y, by = "key")
Y[X, on = "key"] more efficient, allows calculations during join
- breed_info[dogs, on = .(breed)] is like the SQL command:
- dogs LEFT JOIN breed_info ON breed

Left join: unmatched foreign keys get missing values

breed_info[dogs, on = .(breed)] |>
  _[, .(dog_id, name, breed, avg_life_exp, origin)]

     dog_id     name              breed avg_life_exp         origin
     <char>   <char>             <char>        <num>         <char>
  1:   0013    Buddy Labrador Retriever         11.0         Canada
  2:   3382     Lucy    German Shepherd         11.0        Germany
  3:   4200      Max   Golden Retriever         11.0       Scotland
 ---                                                               
168:   8857    Snowy              Corgi           NA           <NA>
169:   0126 Leo King            Whippet         13.5 United Kingdom
170:   8159      Tut            Whippet         13.5 United Kingdom

Inner join: unmatched rows are dropped

breed_info[dogs, on = .(breed), nomatch = NULL] |>
  _[, .(dog_id, name, breed, avg_life_exp, origin)]

     dog_id     name              breed avg_life_exp         origin
     <char>   <char>             <char>        <num>         <char>
  1:   0013    Buddy Labrador Retriever         11.0         Canada
  2:   3382     Lucy    German Shepherd         11.0        Germany
  3:   4200      Max   Golden Retriever         11.0       Scotland
 ---                                                               
145:   8430      Mop            Whippet         13.5 United Kingdom
146:   0126 Leo King            Whippet         13.5 United Kingdom
147:   8159      Tut            Whippet         13.5 United Kingdom

Anti join: list failed matches

dogs[!breed_info, on = .(breed), .(dog_id, breed)]

    dog_id               breed
    <char>              <char>
 1:   1372 Jack Russel Terrier
 2:   2932 Jack Russel Terrier
 3:   6388 Jack Russel Terrier
---                           
21:   5771               Corgi
22:   1742               Corgi
23:   8857               Corgi

Check the key before trusting the join

Keys should not have duplicates:

anyDuplicated(breed_info, by = "breed")

[1] 0

anyDuplicated(events, by = c("dog_id", "event_time"))

[1] 0

Or missing values:

dogs[, .(missing_breed = sum(is.na(breed)))]

   missing_breed
           <int>
1:             0

events[, .(
  NA_dog_id = sum(is.na(dog_id)),
  NA_event_time = sum(is.na(event_time))
)]

   NA_dog_id NA_event_time
       <int>         <int>
1:         0             0

Duplicate keys can explode the row count

bad_lookup <- rbind(
  breed_info[breed == "Labrador Retriever"],
  breed_info[breed == "Labrador Retriever"]
)

nrow(dogs[breed == "Labrador Retriever"])

[1] 6

nrow(bad_lookup[
  dogs[breed == "Labrador Retriever"],
  on = .(breed),
  allow.cartesian = TRUE
])

[1] 12

Joining tables with different units of observation

nrow(dogs)

[1] 170

nrow(sales[dogs, on = .(breed), allow.cartesian = TRUE])

[1] 7155

dogs has one row per dog
sales has many rows per breed over time
Problem is that the tables do not share the same unit of observation

Build a lookup table at the right unit

sales_by_breed <- sales[,
  .(
    sales_n = .N,
    mean_sale_price = round(mean(price_usd), 1),
    last_sale = max(date)
  ),
  by = breed
]

sales_by_breed

                 breed sales_n mean_sale_price  last_sale
                <char>   <int>           <num>     <IDat>
 1:  Poodle (Standard)     143           658.3 2024-12-12
 2: Labrador Retriever      94           551.6 2024-11-28
 3:                Pug     115           653.2 2024-12-23
---                                                      
 8:          Dachshund      89           360.8 2024-12-22
 9:             Beagle      91           362.8 2024-12-14
10:   Golden Retriever      87           561.1 2024-12-20

Validate the lookup before you join

anyDuplicated(sales_by_breed, by = "breed")

[1] 0

nrow(sales_by_breed)

[1] 10

Join again and inspect what is still unmatched

dogs_sales <- sales_by_breed[dogs, on = .(breed)]

nrow(dogs_sales)

[1] 170

dogs_sales[is.na(mean_sale_price), unique(breed)]

 [1] "Bulldog"                  "Boxer"                    "Bull Terrier"            
 [4] "Dalmatian"                "Whippet"                  "Irish Setter"            
 [7] "Jack Russel Terrier"      "Boston Terrier"           "Corgi"                   
[10] "Irish Wolfhound"          "Chesapeake Bay Retriever" "Afghan Hound"

Composite keys are common in event logs

event_timing <- events[, .(
  dog_id,
  event_time,
  event_type,
  event_date
)]

event_staff <- events[, .(
  dog_id,
  event_time,
  shelter_name,
  staff_id
)]

event_staff[event_timing, on = .(dog_id, event_time)][1:6]

   dog_id          event_time           shelter_name staff_id event_type event_date
   <char>              <POSc>                 <char>   <char>     <char>     <IDat>
1:   0013 2023-01-15 08:42:00     Happy Paws Shelter      A12     intake 2023-01-15
2:   0013 2023-03-20 23:40:00     Happy Paws Shelter      B07    adopted 2023-03-20
3:   3382 2023-02-10 09:05:00       City Animal Care      A12     intake 2023-02-10
4:   3382 2023-02-13 14:30:00       City Animal Care      V01  vet_check 2023-02-13
5:   3382 2023-04-01 11:30:00       City Animal Care      C03    adopted 2023-04-01
6:   4200 2023-03-05 10:15:00 Willow Creek Adoptions      A08     intake 2023-03-05

`i.` lets `data.table` compute during the join

dogs_join <- copy(dogs)

dogs_join[breed_info, on = .(breed), avg_years_left := i.avg_life_exp - age]

dogs_join[1:5, .(dog_id, breed, age, avg_years_left)]

   dog_id              breed   age avg_years_left
   <char>             <char> <int>          <num>
1:   0013 Labrador Retriever     5            6.0
2:   3382    German Shepherd     3            8.0
3:   4200   Golden Retriever     7            4.0
4:   6152            Bulldog     4            5.0
5:   8186             Beagle     6            6.5

Join diagnostics should be routine

Compare row counts before and after the join
Check duplicates in the lookup key
Inspect unmatched rows explicitly
Re-state the intended unit of observation

Detect missing combinations: skeleton + anti-join

events[dog_id %in% c("0013", "3382"),
  .(dog_id, event_time, event_type)]

   dog_id          event_time event_type
   <char>              <POSc>     <char>
1:   0013 2023-01-15 08:42:00     intake
2:   0013 2023-03-20 23:40:00    adopted
3:   3382 2023-02-10 09:05:00     intake
4:   3382 2023-02-13 14:30:00  vet_check
5:   3382 2023-04-01 11:30:00    adopted

skeleton <- CJ(
  dog_id = unique(events$dog_id),
  event_type = c("intake", "vet_check", "adopted")
)
skeleton[!events, on = .(dog_id, event_type)][dog_id %in% c("0013", "3382")]

Key: <dog_id, event_type>
   dog_id event_type
   <char>     <char>
1:   0013  vet_check

CJ() cross-joins the levels you expected to see
Anti-join data against the skeleton: what’s left is missing

Alternative: A plot can reveal gaps

combination_grid <- events[skeleton, on = .(dog_id, event_type)]
combination_grid[, present := !is.na(event_time)]
combination_grid[, event_type := factor(event_type, levels = c("intake", "vet_check", "adopted"))]
missing_combo_plot <- ggplot(
  combination_grid,
  aes(x = event_type, y = dog_id, fill = present)
) +
  geom_tile(color = "white", linewidth = 1.2) +
  scale_fill_manual(values = c("TRUE" = "#d9ead3", "FALSE" = "#f4b6a6")) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 6))

A quick plot makes the gap obvious (cont.)

Reshaping

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

One data set, two shapes

Pick shape by use case

Wide is convenient for:
- reading side-by-side and comparing values directly
- joining the result onto a per-unit table
Long is convenient for:
- grouping or summarising by the “variable” itself
- plotting with ggplot (one column per aesthetic: x, y, colour)
- adding a new variable without changing the schema

Reshaping in data.table

Base R has reshape() but it is clunky. data.table has two more intuitive (and much faster) functions:

dcast() for long → wide
melt() for wide → long

`dcast()`: long → wide

events_long <- events[, .N, by = .(event_date, event_type)]
events_long

    event_date event_type     N
        <IDat>     <char> <int>
 1: 2023-01-15     intake     1
 2: 2023-03-20    adopted     1
 3: 2023-02-10     intake     1
---                            
73: 2023-09-14     intake     1
74: 2023-09-16  vet_check     1
75: 2024-01-08  vet_check     1

`dcast()`: long → wide

wide_events <- dcast(
  events_long, # long input
  event_date ~ event_type, # rows ~ new columns
  value.var = "N", # what fills each cell
  fill = 0 # default for empty cells
)
wide_events[1:2]

Key: <event_date>
   event_date adopted intake vet_check
       <IDat>   <int>  <int>     <int>
1: 2023-01-15       0      1         0
2: 2023-02-10       0      1         0

Left of ~ stays as rows
Right of ~ becomes new column names
value.var says which column’s values fill the cells

When cells are not unique, `dcast` aggregates

breed_events <- events[dogs, on = "dog_id", nomatch = NULL]
dcast(breed_events, breed ~ event_type, fun.aggregate = length, value.var = "dog_id")[1:3]

Key: <breed>
           breed adopted intake vet_check
          <char>   <int>  <int>     <int>
1:        Beagle       1      2         2
2: Border Collie       1      1         1
3:         Boxer       3      4         4

Many events share each breed × event_type combination
A cell can only hold one value, so dcast must aggregate
Default is length with a warning; better to say what you mean
- Pass fun.aggregate explicitly: length, sum, mean, min, max

`melt()`: wide → long

long_events <- melt(
  wide_events,
  id.vars = "event_date", # columns to keep as-is
  variable.name = "event_type", # name for old column names
  value.name = "N" # name for old cell values
)
long_events

     event_date event_type     N
         <IDat>     <fctr> <int>
  1: 2023-01-15    adopted     0
  2: 2023-02-10    adopted     0
  3: 2023-02-13    adopted     0
 ---                            
205: 2024-03-30  vet_check     1
206: 2024-04-05  vet_check     0
207: 2024-04-08  vet_check     1

id.vars are the columns that identify a row
All other columns get stacked into one

Why long form pays off

long_events[, .(total = sum(N)), by = event_type]

   event_type total
       <fctr> <int>
1:    adopted    18
2:     intake    30
3:  vet_check    35

Group by event_type directly
- In wide form you’d repeat the calculation for each column
ggplot() wants the same shape: one column per aesthetic
melt() first often simplifies summary and plotting

Example: A missing row can be hard to notice

Manipulating Strings

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

String = character vector

Each element is text wrapped in quotes
Most string functions apply to each element

typeof(c("Buddy", "Lucy"))

[1] "character"

length(c("Buddy", "Lucy"))

[1] 2

nchar(c("Buddy", "Lucy"))

[1] 5 4

String problems show up everywhere

Inconsistent case and whitespace: "Corgi", " corgi", "CORGI "
Typos that drift over time: "Russel" vs "Russell"
The same concept under different labels: "Corgi" vs "Pembroke Welsh Corgi"
Several variables jammed into one column: "03/20-2023"
Encoding artifacts on imported data

Encoding is set at import (recap from L6)

fread(file.path(dog_path, "dog_inventory.csv"))[4:6, .(name, `new owner`)]

      name          new owner
    <char>             <char>
1:   Daisy    Ren\xe9e Dubois
2: Charlie       Kenji Tanaka
3: L\xfana Sof\xeda M\xfcller

For dogs we need to set: fread(..., encoding = "Latin-1")

dogs[grepl("é|í|ö|å", name), unique(name)][1:5]

[1] "Zoé"    "Léo"    "Chloé"  "Söphie" "Növa"

Once imported correctly, treat the column as ordinary text
Mangled accents usually mean the encoding was wrong at import

String problems cause join problems

breed_info[dogs, on = .(breed)][
  is.na(avg_life_exp),
  unique(breed)
]

[1] "Jack Russel Terrier" "Corgi"

breed_info[, unique(grepv("corgi|jack", breed, ignore.case = TRUE))]

[1] "Jack Russell Terrier" "Pembroke Welsh Corgi"

Case and whitespace

sample <- c("Buddy ", "  lucy", "REX", "  Corgi  ")

tolower() / toupper() standardise case

tolower(sample)

[1] "buddy "    "  lucy"    "rex"       "  corgi  "

trimws() removes leading and trailing whitespace

trimws(sample)

[1] "Buddy" "lucy"  "REX"   "Corgi"

nchar() returns each string’s length

nchar(sample)

[1] 6 6 3 9

Building and slicing strings

paste(a, b, sep = " ") and paste0(a, b) glue vectors elementwise

dogs[1:3, paste(name, "the", breed)]

[1] "Buddy the Labrador Retriever" "Lucy the German Shepherd"    
[3] "Max the Golden Retriever"

sprintf("%s ... %d", ...) for templated output: - %s strings, %d integers, %f doubles

dogs[1:3, sprintf("%s (%d years old)", name, age)]

[1] "Buddy (5 years old)" "Lucy (3 years old)"  "Max (7 years old)"

substr(x, start, stop) extracts characters by position

dogs[1:3, substr(name, 1, 3)]

[1] "Bud" "Luc" "Max"

Regex: a pattern language

When equality is not enough:

Find every breed starting with “Lab”
Find typos: “Russel” or “Russell”
Split "03/20-2023" on either / or -

A regular expression (regex) describes a pattern of text:

Built into grepl(), sub(), gsub(), tstrsplit(), …
Same syntax across R, Python, JavaScript, grep, sed
A plain string is already a valid regex — "Corgi" matches "Corgi"
The power comes from metacharacters: ^, $, ., *, +, [...], |

Reading a regex

^[A-Z][a-z]+ Russel+$ reads left-to-right as:

^ — start of string
[A-Z] — one uppercase letter
[a-z]+ — one or more lowercase letters
Russe — a space, then literal text
l+ — one or more ls (matches “Russel” and “Russell”)
$ — end of string

Reading a regex

Symbol	Meaning
`[abc]`	one of `a`, `b`, `c`
`[a-z]`	any lowercase letter
`.`	any single character
`*` / `+` / `?`	zero+, one+, or zero-or-one of the previous
`\\s` / `\\d`	whitespace / digit
`^` / `$`	start / end of string
`\|`	OR — either pattern

Detect with `grepl`

# literal: contains "Corgi"
dogs[grepl("Corgi", breed), unique(breed)]

[1] "Corgi"

# alternation: either "Russel" or "Corgi"
dogs[grepl("Russel|Corgi", breed), unique(breed)]

[1] "Jack Russel Terrier" "Corgi"

# case-insensitive
dogs[grepl("corgi", breed, ignore.case = TRUE), unique(breed)]

[1] "Corgi"

grepl(pattern, x) returns TRUE/FALSE per element
grepv() returns the matching values

Anchors

^ matches the start of the string

breed_info[, grepv("^Lab", breed)]

[1] "Labrador Retriever"

$ matches the end of the string

breed_info[, grepv("ver$", breed)]

[1] "Labrador Retriever"       "Golden Retriever"         "Chesapeake Bay Retriever"

Replace with `sub` and `gsub`

dogs[, breed := gsub("Rus+el+", "Russell", breed)]
dogs[grep("[Cc]orgi", breed), breed := "Pembroke Welsh Corgi"]

breed_info[dogs, on = .(breed)][
  is.na(avg_life_exp),
  unique(breed)
]

character(0)

sub(pat, repl, x) — replace first match per element
gsub(pat, repl, x) — replace all matches per element

gsub("\\s+", " ", c("Border  Collie", "Labrador   Retriever"))

[1] "Border Collie"      "Labrador Retriever"

\\s for a whitespace character
+ for one or more of the preceding character

Manipulating Dates and Timestamps

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

What is a date object?

A Date stores a calendar day with no time of day
Internally it is a number: days since 1970-01-01
Arithmetic and comparisons work on that number
Printing formats the number back as a calendar string

d <- as.Date("2024-02-28")

class(d)

[1] "Date"

unclass(d)

[1] 19781

d + 3

[1] "2024-03-02"

d - as.Date("2022-02-28")

Time difference of 730 days

What is a timestamp object?

A POSIXct stores an instant in time, accurate to seconds
Internally it is a number: seconds since 1970-01-01 UTC
It carries a timezone attribute that controls how it prints

t <- as.POSIXct("2024-03-20 23:40:00", tz = "Europe/Stockholm")

class(t)

[1] "POSIXct" "POSIXt"

unclass(t)

[1] 1710974400
attr(,"tzone")
[1] "Europe/Stockholm"

format(t, tz = "UTC", usetz = TRUE)

[1] "2024-03-20 22:40:00 UTC"

`Date` vs `data.table::IDate`

IDate is data.table’s calendar-date type — integer-backed, ~half the memory of base R Date
IDate inherits from Date, so the two are interchangeable in almost all operations
Date / IDate: day only — no time of day, no timezone
Don’t record time if you don’t need it — it’s a common source of problems

Strings that look like dates are not dates

dogs[, sale_date - intake_date]

Error in `-.IDate`:
! can only subtract from "IDate" objects

sapply(dogs[, .(intake_date, sale_date)], class)

$intake_date
[1] "IDate" "Date" 

$sale_date
[1] "character"

Parsing strings into dates

Same idea for both: tell R the format, get a real type back
as.IDate(x, format = ...) for calendar dates
as.POSIXct(x, format = ..., tz = ...) for timestamps
fread() parses common formats automatically — flag suspect columns and re-parse
The format string describes the input, not the output

`strptime` format codes

Code	Meaning
`%Y`	year, 4-digit
`%y`	year, 2-digit
`%m`	month number
`%B`	month name
`%b`	month abbreviated
`%d`	day of month

Code	Meaning
`%H`	hour (24h)
`%M`	minute
`%S`	second
`%F`	`%Y-%m-%d` shortcut
`%T`	`%H:%M:%S` shortcut

as.IDate("13 March 2024", format = "%d %B %Y")

[1] "2024-03-13"

Parse explicitly when the format is unusual

str(dogs$sale_date)

 chr [1:170] "03/20-2023" "04/01-2023" "06/15-2023" "07/11-2023" "08/02-2023" ...

dogs[, sale_date := as.IDate(sale_date, format = "%m/%d-%Y")]
dogs[1:5, .(dog_id, intake_date, sale_date)]

   dog_id intake_date  sale_date
   <char>      <IDat>     <IDat>
1:   0013  2023-01-15 2023-03-20
2:   3382  2023-02-10 2023-04-01
3:   4200  2023-03-05 2023-06-15
4:   6152  2023-04-20 2023-07-11
5:   8186  2023-05-11 2023-08-02

Avoid format codes with `lubridate`

Function names spell out component order: ymd, mdy, dmy
ymd_hms("2024-03-20 23:40:00") for timestamps
Optional tz = argument; otherwise UTC

mdy("03/20-2023")

[1] "2023-03-20"

ymd_hms("2024-03-20 23:40:00", tz = "Europe/Stockholm")

[1] "2024-03-20 23:40:00 CET"

`fread()` recognises timestamps automatically

ts <- fread(
  file.path(dog_path, "dog_events.csv"),
  colClasses = list(character = "dog_id")
)[dog_id == "0013" & event_type == "adopted", .(dog_id, event_type, event_time)]
ts

   dog_id event_type          event_time
   <char>     <char>              <POSc>
1:   0013    adopted 2023-03-20 23:40:00

class(ts$event_time)

[1] "POSIXct" "POSIXt"

attr(ts$event_time, "tzone")

[1] "UTC"

Warning! Assumes the timezone is UTC

Assuming the wrong timezone can shift dates

format(ts$event_time, tz = "Europe/Stockholm")

[1] "2023-03-21 00:40:00"

as.POSIXct(ts$event_time, tz = "Europe/Stockholm")

[1] "2023-03-21 00:40:00 CET"

as.IDate(ts$event_time, tz = "Europe/Stockholm")

[1] "2023-03-21"

All outputs change time/dates instead of timezone.

Fix: read as text, then parse with the correct timezone

ts_fix <- fread(
  file.path(dog_path, "dog_events.csv"),
  colClasses = list(character = c("dog_id", "event_time"))
)[dog_id == "0013" & event_type == "adopted", .(dog_id, event_type, event_time)]
ts_fix[,
  event_time := as.POSIXct(
    event_time,
    format = "%Y-%m-%d %H:%M:%S",
    tz = "Europe/Stockholm"
  )
]
ts_fix

   dog_id event_type          event_time
   <char>     <char>              <POSc>
1:   0013    adopted 2023-03-20 23:40:00

Better fix from the `lubridate` package

with_tz(t, "Europe/Stockholm") keeps instant, changes display

with_tz(ts$event_time, "Europe/Stockholm")

[1] "2023-03-21 00:40:00 CET"

force_tz(t, "Europe/Stockholm") keeps time, changes instant

force_tz(ts$event_time, "Europe/Stockholm")

[1] "2023-03-20 23:40:00 CET"

Date arithmetic and comparison

dogs[, stay_days := sale_date - intake_date]

dogs[
  !is.na(stay_days),
  .(
    average_stay_days = mean(stay_days),
    longest_stay = max(stay_days)
  )
]

   average_stay_days longest_stay
          <difftime>   <difftime>
1:     87.19403 days     116 days

dogs[sale_date > as.IDate("2024-01-01"), .N]

[1] 42

Extract date components (with `lubridate`)

dogs[
  !is.na(sale_date),
  .(
    sale_date,
    year = year(sale_date),
    month = month(sale_date),
    day = mday(sale_date),
    weekday = wday(sale_date)
  )
][1:5]

    sale_date  year month   day weekday
       <IDat> <num> <num> <int>   <num>
1: 2023-03-20  2023     3    20       2
2: 2023-04-01  2023     4     1       7
3: 2023-06-15  2023     6    15       5
4: 2023-07-11  2023     7    11       3
5: 2023-08-02  2023     8     2       4

Round dates and month-safe arithmetic (`lubridate`)

floor_date(as.Date("2024-03-20"), unit = "month")

[1] "2024-03-01"

ceiling_date(as.Date("2024-03-20"), unit = "week")

[1] "2024-03-24"

as.Date("2024-01-31") + months(1) # NA — Feb 31 does not exist

[1] NA

as.Date("2024-01-31") %m+% months(1) # 2024-02-29 — clamps to month end

[1] "2024-02-29"

Using `floor_date` in `by=` for monthly aggregation

# Aggregate sales by month
dogs[!is.na(sale_date), .N, by = .(month = floor_date(sale_date, "month"))][1:6]

        month     N
       <Date> <int>
1: 2023-03-01     1
2: 2023-04-01     1
3: 2023-06-01     7
4: 2023-07-01     9
5: 2023-08-01    14
6: 2023-09-01    13

Time differences with `difftime`

t1 <- as.POSIXct("2024-03-20 23:40:00", tz = "Europe/Stockholm")
t2 <- as.POSIXct("2024-03-21 09:15:00", tz = "Europe/Stockholm")

t2 - t1

Time difference of 9.583333 hours

difftime(t2, t1, units = "hours")

Time difference of 9.583333 hours

as.numeric(difftime(t2, t1, units = "hours"))

[1] 9.583333

Using dates in diagnostic plots

A count over time graph can reveal missing data that a tabular check would miss.

events[, .N, by = event_date] |>
  ggplot(aes(x = event_date, y = N)) +
  geom_col() +
  labs(x = NULL, y = "events per day")

Iteration

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

Vectorise or use `by` before you iterate with loops

Same summary by group → by=
Same rule on several columns → .SD
Same operation across separate files or objects → iterate

`lapply()` and relatives

lapply(x, f) calls f on each element of x — returns a list, one result per input
Predictable shape: length(input) == length(output)
The wider family differs in how the output is shaped:
- sapply(x, f) — simplify to a vector or matrix when possible
- vapply(x, f, FUN.VALUE) — like sapply with a type contract
- mapply(f, x, y) — parallel iteration over several vectors
- apply(M, MARGIN, f) — over rows or columns of a matrix

Counting the number of characters in each word

words <- c("Buddy", "Lucy", "Rex")

# list of results
lapply(words, nchar)

[[1]]
[1] 5

[[2]]
[1] 4

[[3]]
[1] 3

# simplified to a named vector
sapply(words, nchar)

Buddy  Lucy   Rex 
    5     4     3

Example: reading many files at once

basename(event_paths)

[1] "city_animal_care_events.csv"       "happy_paws_shelter_events.csv"    
[3] "willow_creek_adoptions_events.csv"

One CSV per shelter, dropped in a folder
Read them all with the right types, stacked into one table
Keep track of which row came from which file

Write a reusable reader function

read_dog_events <- function(path) {
  if (!file.exists(path)) {
    warning("File not found, skipping: ", path)
    return(NULL)
  }
  dt <- fread(path, colClasses = list(character = c("dog_id", "event_time")))
  dt[,
    event_time := as.POSIXct(
      event_time,
      format = "%Y-%m-%d %H:%M:%S",
      tz = "Europe/Stockholm"
    )
  ]
  dt[, event_date := as.IDate(event_time, tz = "Europe/Stockholm")]
  dt
}

Pin down the format quirks: character IDs, Swedish local time
warning() flags a problem without aborting the batch
Returning NULL lets rbindlist() skip the file silently

Workhorse pattern: `lapply()` + `rbindlist(idcol = ...)`

event_list <- lapply(event_paths, read_dog_events)
names(event_list) <- basename(event_paths)

all_events <- rbindlist(event_list, idcol = "source_file")
all_events[, .(rows = .N, dogs = uniqueN(dog_id)), by = source_file]

                         source_file  rows  dogs
                              <char> <int> <int>
1:       city_animal_care_events.csv    31    12
2:     happy_paws_shelter_events.csv    27    13
3: willow_creek_adoptions_events.csv    25    11

lapply() returns one list element per file
Naming the list lets rbindlist(idcol = ...) keep provenance
rbindlist silently drops NULL entries
- A missing file just doesn’t contribute rows

Validation checklist

After every join, reshape, or type change:

Row count — did it change as expected?
Key uniqueness — duplicates can explode joins
Unmatched rows — which ones, and why?
Missing combinations — anything absent that should exist?
Types — dates are dates, IDs stayed character

Main takeaways

Join problems are usually key problems
Wide vs long is a question about what each row represents — pick the shape that fits the use case
String and date cleaning are often prerequisites for correct joins
Keep validating: check row counts, unit of observation, unmatched rows after every join, reshape, or type change
Wrap repeated logic in a function, then apply it with lapply()

Next lecture: APIs and external data

Extra: More on strings and dates

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

Split with `tstrsplit`

sale_parts <- dogs[!is.na(sale_date), .(sale_date)]
sale_parts[,
  c("sale_month", "sale_day", "sale_year") := tstrsplit(sale_date, "[/-]")
]

sale_parts[1:5]

    sale_date sale_month sale_day sale_year
       <IDat>     <char>   <char>    <char>
1: 2023-03-20       2023       03        20
2: 2023-04-01       2023       04        01
3: 2023-06-15       2023       06        15
4: 2023-07-11       2023       07        11
5: 2023-08-02       2023       08        02

[/-] matches either / or - — handles inconsistent delimiters in one pass
Useful when one column hides several variables
For dates specifically, parse to a real type instead — that is the next section

`stringr` offers a more consistent grammar

All functions start with str_ and take the string first
str_detect(x, pattern) for grepl
str_replace_all(x, pattern, replacement) for gsub
str_to_lower(x), str_squish(x), str_split_fixed(x, sep, n)
Named vectors handle multiple replacements at once

library(stringr)
dogs[, breed := str_replace_all(breed, c(
    "Russel"  = "Russell",
    "^Corgi$" = "Pembroke Welsh Corgi"
))]

ISO 8601 week convention

Plain “week of the year” is ambiguous — does week 1 start Sun or Mon? Does it start in the old or new year?
ISO 8601 fixes a convention:
- weeks run Mon–Sun
- week 1 is the one containing the year’s first Thursday
Standard in business, public-health, and official statistics reporting (Eurostat, ECDC, retail) — needed if your data must line up with those sources

ISO week footgun

ISO weeks have their own year. Dec 30, 2024 is ISO week 1 of 2025; pair isoyear() with isoweek(), never year()

dogs[
  !is.na(sale_date),
  .N,
  by = .(
    iso_year = isoyear(sale_date),
    iso_week = isoweek(sale_date)
  )
][order(iso_year, iso_week)][1:8]

   iso_year iso_week     N
      <num>    <num> <int>
1:     2023       12     1
2:     2023       13     1
3:     2023       18     1
4:     2023       19     2
5:     2023       20     1
6:     2023       21     1
7:     2023       22     1
8:     2023       23     2

Extra: DT non-equi joins and rolling joins

Joining
Reshaping
Manipulating Strings
Manipulating Dates and Timestamps
Iteration
Extra: More on strings and dates
Extra: DT non-equi joins and rolling joins

Non-equi joins: match with inequalities

window <- data.table(start = as.IDate("2023-06-01"), end = as.IDate("2023-06-15"))
sales[window,
  on = .(date >= start, date <= end),
  .(breed, sale_date = x.date, price_usd),
  nomatch = NULL]

                 breed  sale_date price_usd
                <char>     <IDat>     <int>
 1:          Chihuahua 2023-06-08       341
 2:     French Bulldog 2023-06-08       867
 3: Labrador Retriever 2023-06-03       592
---                                        
19:      Border Collie 2023-06-01       486
20:     French Bulldog 2023-06-13      1233
21:             Beagle 2023-06-09       402

on = accepts <, <=, >, >= instead of ==
Match: rows where date falls inside the window

Rolling joins: as-of lookup

sales_lab <- sales[breed == "Labrador Retriever", .(date, price_usd)]
setkey(sales_lab, date)
queries <- data.table(date = as.IDate(c("2023-01-15", "2023-06-15", "2023-12-08")))
sales_lab[queries, on = "date", roll = TRUE]

         date price_usd
       <IDat>     <int>
1: 2023-01-15       618
2: 2023-06-15       592
3: 2023-12-08       587

For each query date, find the most recent earlier observation
roll = TRUE carries the last prior value forward; roll = "nearest" picks closest in either direction

Data table magic

Now combine both ideas. Let’s calculate the average price during two months around each sale:

dogs[
  sales[
    dogs[
      !is.na(sale_date),
      .(
        dog_id,
        breed,
        date1 = as.Date(sale_date) %m-% months(2),
        date2 = as.Date(sale_date) %m+% months(2)
      )
    ],
    on = c("breed", "date >= date1", "date < date2"),
    by = .EACHI,
    .(dog_id, mean_price = mean(price_usd, na.rm = TRUE))
  ],
  on = "dog_id",
  avg_price_2months := i.mean_price
]

Data table magic (cont.)

Let’s verify it worked

dogs[dog_id == 9367, .(breed, sale_date, avg_price_2months)]

           breed  sale_date avg_price_2months
          <char>     <IDat>             <num>
1: Border Collie 2024-01-12          420.0909

sales[
  breed == "Border Collie" &
    date %between%
      list(
        as.Date("2024-01-12") - months(2),
        as.Date("2024-01-12") + months(2)
      ),
  mean(price_usd)
]

[1] 420.0909

Lecture 7: Data Wrangling II

Today

Lecture 6 gave us the basic grammar

Joining

Joining: matching rows across tables

SQL join types

Primary key vs foreign key

Most join bugs are key bugs

Joining in data.table

Left join: unmatched foreign keys get missing values

Inner join: unmatched rows are dropped

Anti join: list failed matches

Check the key before trusting the join

Duplicate keys can explode the row count

Joining tables with different units of observation

Build a lookup table at the right unit

Validate the lookup before you join

Join again and inspect what is still unmatched

Composite keys are common in event logs

i. lets data.table compute during the join

Join diagnostics should be routine

Detect missing combinations: skeleton + anti-join

Alternative: A plot can reveal gaps

A quick plot makes the gap obvious (cont.)

Reshaping

One data set, two shapes

Pick shape by use case

Reshaping in data.table

dcast(): long → wide

dcast(): long → wide

When cells are not unique, dcast aggregates

melt(): wide → long

Why long form pays off

Example: A missing row can be hard to notice

Manipulating Strings

String = character vector

String problems show up everywhere

Encoding is set at import (recap from L6)

String problems cause join problems

Case and whitespace

Building and slicing strings

Regex: a pattern language

Reading a regex

Reading a regex

Detect with grepl

Anchors

Replace with sub and gsub

Manipulating Dates and Timestamps

What is a date object?

What is a timestamp object?

Date vs data.table::IDate

Strings that look like dates are not dates

Parsing strings into dates

strptime format codes

Parse explicitly when the format is unusual

Avoid format codes with lubridate

fread() recognises timestamps automatically

Assuming the wrong timezone can shift dates

Fix: read as text, then parse with the correct timezone

Better fix from the lubridate package

Date arithmetic and comparison

Extract date components (with lubridate)

Round dates and month-safe arithmetic (lubridate)

Using floor_date in by= for monthly aggregation

Time differences with difftime

Using dates in diagnostic plots

Iteration

Vectorise or use by before you iterate with loops

lapply() and relatives

Counting the number of characters in each word

Example: reading many files at once

Write a reusable reader function

Workhorse pattern: lapply() + rbindlist(idcol = ...)

Validation checklist

Main takeaways

Next lecture: APIs and external data

Extra: More on strings and dates

Split with tstrsplit

stringr offers a more consistent grammar

ISO 8601 week convention

`i.` lets `data.table` compute during the join

`dcast()`: long → wide

`dcast()`: long → wide

When cells are not unique, `dcast` aggregates

`melt()`: wide → long

Detect with `grepl`

Replace with `sub` and `gsub`

`Date` vs `data.table::IDate`

`strptime` format codes

Avoid format codes with `lubridate`

`fread()` recognises timestamps automatically

Better fix from the `lubridate` package

Extract date components (with `lubridate`)

Round dates and month-safe arithmetic (`lubridate`)

Using `floor_date` in `by=` for monthly aggregation

Time differences with `difftime`

Vectorise or use `by` before you iterate with loops

`lapply()` and relatives

Workhorse pattern: `lapply()` + `rbindlist(idcol = ...)`

Split with `tstrsplit`

`stringr` offers a more consistent grammar