Lecture 6: Data Wrangling I

Adam Altmejd

adam.altmejd@su.se

The Institute for Evaluation of Labour Market and Education Policy (IFAU)

2026-04-28

Today

Importing real files
Core data.table grammar: DT[i, j, by] and tooling
Missing-values and type repair
Writing a cleaned file back to disk
Wrangling with functions

Importing: Raw data is not always easy to read

A file is just bytes
Import decides:
- delimiter
- types
- encoding
- missing-values
Inspection checks whether those choices were sensible

Wrangling: Make data usable for analysis

Clean inconsistencies
Tidy data
Transform data, create variables

Importing Data

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Repeating the import checklist

Before doing any real work, ask:

Where is the file, and what produced it?
Are identifiers read with the right type?
Are missing values coded as blanks, NA, or fake numbers?
Do names need repair?
What is the intended key?

`data.table::fread()` for text files

fread() is fast and flexible for delimited text (CSV, TSV)
Automatically solves many possible data issues

Let’s import data from an AI-generated kennel

dog_inventory.csv our inventory
dog_breeds.csv a lookup-table of information
dog_events.csv event history for each dog
dog_sales.csv “market research”

dog_path <- here::here(
  "data-sources", "data", "dogs", "dog_inventory.csv"
)
list.files(
  dirname(dog_path),
  pattern = "^dog_.*\\.csv$"
)

[1] "dog_breeds.csv"    "dog_events.csv"    "dog_inventory.csv" "dog_sales.csv"

View your raw data

readLines(dog_path, n = 6)

[1] "Dog Shelter Intake Data - Q1 2025 Records with Pricing and owner Info"           
[2] "Last updated 2025-03-31"                                                         
[3] "---"                                                                             
[4] ""                                                                                
[5] "dog_id,name,breed,age,color,price,insured,intake_date,sale_date,new owner"       
[6] "0013,Buddy,Labrador Retriever,5,Yellow,450.50,1,2023-01-15,03/20-2023,John Smith"

`fread()` is forgiving, but can’t solve all issues

fread(dog_path)[3:4]

Warning in fread(dog_path): Discarded single-line footer: <<The data is
computer-generated for educational purposes.>>

   dog_id   name            breed   age  color  price insured intake_date  sale_date
    <int> <char>           <char> <int> <char>  <num>   <int>      <IDat>     <char>
1:   4200    Max Golden Retriever     7 Golden 550.75       0  2023-03-05 06/15-2023
2:   6152  Daisy          Bulldog     4  White 700.00       1  2023-04-20 07/11-2023
         new owner
            <char>
1:   James Johnson
2: Ren\xe9e Dubois

Good news: fread() found the real header
Less good: that still does not mean the import is correct

Encoding problems show up in the values

fread(dog_path)[c(4, 6), .(name, `new owner`)]

Warning in fread(dog_path): Discarded single-line footer: <<The data is
computer-generated for educational purposes.>>

      name          new owner
    <char>             <char>
1:   Daisy    Ren\xe9e Dubois
2: L\xfana Sof\xeda M\xfcller

Computers need to encode characters to store them
Unfortunately there are tons of different encodings
A-Z, 0-9 is usually the same, but code for e.g. å can vary
Can create a real mess…

Tell `fread` what you know

fread(
  dog_path,
  skip = 3,
  encoding = "Latin-1"
)[c(4, 6), .(name, `new owner`)]

Warning in fread(dog_path, skip = 3, encoding = "Latin-1"): Discarded single-line footer:
<<The data is computer-generated for educational purposes.>>

     name    new owner
   <char>       <char>
1:  Daisy Renée Dubois
2:   Lúna Sofía Müller

skip = 3 removes the preamble explicitly
encoding = "Latin-1" tells fread the file is in Latin-1 (an old single-byte Western European encoding)

Repairing names

names(fread(dog_path, skip = 3, encoding = "Latin-1"))[9:10]

[1] "sale_date" "new owner"

names(fread(dog_path, skip = 3, encoding = "Latin-1", check.names = TRUE))[9:10]

[1] "sale_date" "new.owner"

check.names = TRUE converts spaces into syntactic names

Protect identifiers at import time

dogs_bad <- fread(dog_path, skip = 3, encoding = "Latin-1")
dogs_good <- fread(
  dog_path,
  skip = 3,
  encoding = "Latin-1",
  colClasses = list(character = "dog_id")
)

dogs_bad[1:5, dog_id]

[1]   13 3382 4200 6152 8186

dogs_good[1:5, dog_id]

[1] "0013" "3382" "4200" "6152" "8186"

A good import call is explicit

dogs <- fread(
  dog_path,
  skip = 3,
  encoding = "Latin-1",
  check.names = TRUE,
  na.strings = c("", "-99"),
  colClasses = list(character = "dog_id")
)

dogs[1:3, .(dog_id, name, intake_date, sale_date, new.owner)]

   dog_id   name intake_date  sale_date     new.owner
   <char> <char>      <IDat>     <char>        <char>
1:   0013  Buddy  2023-01-15 03/20-2023    John Smith
2:   3382   Lucy  2023-02-10 04/01-2023  Maria Garcia
3:   4200    Max  2023-03-05 06/15-2023 James Johnson

`select` the columns you use

fread(
  dog_path,
  skip = 3,
  encoding = "Latin-1",
  select = c(
    "dog_id",
    "name",
    "breed",
    "intake_date",
    "sale_date"
  ),
  colClasses = list(character = "dog_id")
)[1:5]

   dog_id    name              breed intake_date  sale_date
   <char>  <char>             <char>      <IDat>     <char>
1:   0013   Buddy Labrador Retriever  2023-01-15 03/20-2023
2:   3382    Lucy    German Shepherd  2023-02-10 04/01-2023
3:   4200     Max   Golden Retriever  2023-03-05 06/15-2023
4:   6152   Daisy            Bulldog  2023-04-20 07/11-2023
5:   8186 Charlie             Beagle  2023-05-11 08/02-2023

Importing Stata `.dta` files with `haven`

Use haven::read_dta(). Preserves Stata labels.
Convert to data.table with as.data.table(),
Set value labels with as_factors()

library(haven)

stata_dt = read_dta("path/to/stata_file.dta") |>
  as.data.table()

stata_dt[, factor_variable := as_factor(labelled_variable)]

Reading excel files with `readxl`

library(readxl)
excel_dt = read_excel("path/to/spreadsheet.xlsx")

readxl supports selecting sheets:

excel_sheets("path/to/spreadsheet.xlsx")
read_excel("spreadsheet.xlsx", sheet = "Sheet2")

and even specific ranges:

read_excel("spreadsheet.xlsx", range = "A1:D10", skip = 2)

Core data.table Grammar

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

`data.table` and `dplyr`

Two dominant paradigms for data manipulation in R
Both are powerful and widely used, but with different syntax
We will focus on data.table, because it is much faster

`data.table`: Core concept

An enhancement of base R’s data.frame
Fast: Operations implemented in C
Memory efficient: Modifies by reference using :=
Indexing: Automatic indexing speeds up subsets and joins
Concise syntax: Combines operations efficiently

Read `DT[i, j, by]` as a sentence

“from this table, keep these rows (i), then return (and compute) these columns (j), grouped by these variables (by)”

`i` filters rows

dogs[
  age >= 5 & price > 500,
  .(dog_id, name, breed, age, price)
]

    dog_id     name            breed   age   price
    <char>   <char>           <char> <int>   <num>
 1:   4200      Max Golden Retriever     7  550.75
 2:   4094   Cooper        Dachshund     5  650.50
 3:   6988     Bear            Boxer     6  900.00
---                                               
75:   8279   Bonnie     Irish Setter     7  690.50
76:   9367   Fluffy    Border Collie     8  600.00
77:   0126 Leo King          Whippet     5 1010.50

`i` can use membership too

dogs[
  breed %in% c("Pug", "Boxer", "Border Collie") & insured == 1,
  .(dog_id, name, breed, insured)
]

    dog_id   name         breed insured
    <char> <char>        <char>   <int>
 1:   2652   Milo         Boxer       1
 2:   6635  Bella         Boxer       1
 3:   6988   Bear         Boxer       1
---                                    
18:   3599 Brutus           Pug       1
19:   4636  Balto Border Collie       1
20:   2977  Major           Pug       1

`%between%` filters ranges

dogs[
  price %between% c(400, 700) & age %between% c(2, 5),
  .(dog_id, name, age, price)
]

    dog_id     name   age price
    <char>   <char> <int> <num>
 1:   0013    Buddy     5 450.5
 2:   3382     Lucy     3 600.0
 3:   6152    Daisy     4 700.0
---                            
33:   5497 Clifford     5 680.0
34:   2738   Ripley     2 610.5
35:   5771    Slick     4 670.0

Shorthand for x >= a & x <= b, both endpoints included
Reads closer to intent than a chained comparison

`j` selects columns

dogs[
  ,
  .(dog_id, name, breed, age, price, insured)
]

     dog_id     name              breed   age   price insured
     <char>   <char>             <char> <int>   <num>   <int>
  1:   0013    Buddy Labrador Retriever     5  450.50       1
  2:   3382     Lucy    German Shepherd     3  600.00       1
  3:   4200      Max   Golden Retriever     7  550.75       0
 ---                                                         
169:   8857    Snowy              Corgi     2  990.00      NA
170:   0126 Leo King            Whippet     5 1010.50       0
171:   8159      Tut            Whippet     0  750.75       1

`j` can compute new outputs

dogs[
  age > 0,
  .(
    dog_id,
    name,
    breed,
    price_per_year = round(price / age, 1)
  )
][1:8]

   dog_id    name              breed price_per_year
   <char>  <char>             <char>          <num>
1:   0013   Buddy Labrador Retriever           90.1
2:   3382    Lucy    German Shepherd          200.0
3:   4200     Max   Golden Retriever           78.7
4:   6152   Daisy            Bulldog          175.0
5:   8186 Charlie             Beagle           50.0
6:   1893    Lúna            Bulldog          400.0
7:   4094  Cooper          Dachshund          130.1
8:   1099   Sadie          Dachshund           50.0

`:=` modifies by reference

dogs_work <- copy(dogs)

dogs_work[, price_per_year := fifelse(
  age > 0,
  round(price / age, 1),
  NA_real_
)]

dogs_work[
  1:5,
  .(dog_id, name, age, price, price_per_year)
]

   dog_id    name   age  price price_per_year
   <char>  <char> <int>  <num>          <num>
1:   0013   Buddy     5 450.50           90.1
2:   3382    Lucy     3 600.00          200.0
3:   4200     Max     7 550.75           78.7
4:   6152   Daisy     4 700.00          175.0
5:   8186 Charlie     6 300.25           50.0

Use `copy()` when you need separation

"price_per_year" %in% names(dogs)

[1] FALSE

"price_per_year" %in% names(dogs_work)

[1] TRUE

Comparison to `dplyr`

Part of the Tidyverse collection of packages
Focuses on readability and consistency
Uses distinct “verbs” for common operations

data.table

data[condition,
     .(new_var = mean(old_var)),
     by = "group_var"]

dplyr

data |>
  filter(condition) |>
  group_by(group_var) |>
  summarise(new_var = mean(old_var))

`data.table` vs. `dplyr` - key differences

data.table

Pros: Speed, memory efficiency, concise syntax, powerful indexing/joins. Ideal for large data.
Cons: Steeper learning curve, syntax less “English-like”, modify-by-reference needs care.

dplyr

Pros: Highly readable verbs, Tidyverse integration (ggplot2, etc.).
Cons: Slower, uses more memory, especially on large data or complex groups.

Useful data.table Tooling

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Setting a key creates an index for faster operations

dogs[, .(dog_id, name, breed)][1:2]

   dog_id   name              breed
   <char> <char>             <char>
1:   0013  Buddy Labrador Retriever
2:   3382   Lucy    German Shepherd

setkey(dogs, dog_id, breed)
dogs[, .(dog_id, name, breed)][1:2]

Key: <dog_id, breed>
   dog_id   name              breed
   <char> <char>             <char>
1:   0009  Scout      Border Collie
2:   0013  Buddy Labrador Retriever

Removing columns by reference with `:= NULL`

dogs_work[, price_per_year := NULL]
"price_per_year" %in% names(dogs_work)

[1] FALSE

Renaming and reordering by reference

dogs_demo <- copy(dogs)

setnames(dogs_demo, c("new.owner", "intake_date"), c("owner", "intake"))
setcolorder(dogs_demo, c("dog_id", "name", "breed", "owner"))

names(dogs_demo)

 [1] "dog_id"    "name"      "breed"     "owner"     "age"       "color"     "price"    
 [8] "insured"   "intake"    "sale_date"

:= NULL deletes
setnames() renames
setcolorder() reorders

`fcase()` is cleaner than nested `ifelse()`

dogs_work[, fee_group := fcase(
  price < 400, "low",
  price < 700, "middle",
  default = "high"
)]

dogs_work[, .(dog_id, name, price, fee_group)]

     dog_id     name   price fee_group
     <char>   <char>   <num>    <char>
  1:   0013    Buddy  450.50    middle
  2:   3382     Lucy  600.00    middle
  3:   4200      Max  550.75    middle
 ---                                  
169:   8857    Snowy  990.00      high
170:   0126 Leo King 1010.50      high
171:   8159      Tut  750.75      high

Diagnose distinct values and duplicates

dogs[, .(
  rows = .N,
  distinct_ids = uniqueN(dog_id),
  distinct_breeds = uniqueN(breed)
)]

    rows distinct_ids distinct_breeds
   <int>        <int>           <int>
1:   171          170              21

dogs[duplicated(dogs, by = "dog_id"), .(dog_id, name)]

   dog_id   name
   <char> <char>
1:   4200    Max

dim(unique(dogs, by = "dog_id"))

[1] 170  10

uniqueN() counts distinct values; unique() returns them
uniqueN(key) == nrow(dt) is a one-line uniqueness test
duplicated(dt, by = ...) lists offending rows
unique(dt, by = ...) keeps one row per key

Piping with `data.table`

data.table supports the native pipe |>
Use _ as a placeholder for the output of the previous step

dogs |>
  _[age > 0] |>
  _[, price_per_yr := round(price / age)] |>
  _[order(-price_per_yr)] |>
  _[, .(dog_id, name, price_per_yr)]

     dog_id     name price_per_yr
     <char>   <char>        <num>
  1:   6447 Hercules         1000
  2:   6941   Zephyr          950
  3:   9866   Cherry          790
 ---                             
160:   6197    Benji           49
161:   6596    Oscar           46
162:   1976    Honey           41

Grouping and Summarizing

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Grouping and summarizing with `by=`

i chooses rows
j says what to return or compute
by says which groups should get separate answers

Grouping and summarizing with `by=` (cont.)

Evaluate j separately for each group:

dogs[, mean(price), by = .(insured)]

   insured       V1
     <int>    <num>
1:       0 651.2295
2:       1 735.3387
3:      NA 810.1875

Can group by multiple variables, and by conditions:

dogs[, mean(price), 
     by = .(insured, 
            white = color == "White")]

   insured  white       V1
     <int> <lgcl>    <num>
1:       0  FALSE 651.4461
2:       1  FALSE 735.5226
3:       1   TRUE 733.7500
4:       0   TRUE 650.1250
5:      NA  FALSE 883.3333
6:      NA   TRUE 590.7500

Grouping and summarizing with `by=` (cont.)

Find cheapest dog by breed, ordered by how frequent the breed is

dogs[, .(.N, min(price)), by = "breed"] |>
  _[order(-N)]

                       breed     N     V2
                      <char> <int>  <num>
 1:                      Pug    16 600.00
 2:                  Whippet    15 560.00
 3:            Border Collie    13 510.00
---                                      
19:          Irish Wolfhound     4 840.25
20:                  Bulldog     3 700.00
21: Chesapeake Bay Retriever     2 720.00

Grouped counts are useful diagnostics

dogs[, .(
  rows = .N,
  missing_sale_date = sum(is.na(sale_date)),
  insured_share = round(mean(insured == 1, na.rm = TRUE), 2)
), by = breed][order(-rows)]

                       breed  rows missing_sale_date insured_share
                      <char> <int>             <int>         <num>
 1:                      Pug    16                 3          0.73
 2:                  Whippet    15                 4          0.71
 3:            Border Collie    13                 2          0.31
---                                                               
19:          Irish Wolfhound     4                 2          1.00
20:                  Bulldog     3                 1          1.00
21: Chesapeake Bay Retriever     2                 0          1.00

Transforming with `:=` and `by=`

Saving the average price by age to a new column, repeating the summary for each row in group:

dogs[, avg_price_by_age := mean(price),
     by = .(age)]

dogs[order(age),
     .(age, avg_price_by_age)]

       age avg_price_by_age
     <int>            <num>
  1:    -5        1120.0000
  2:    -4         820.0000
  3:    -3         740.5000
 ---                       
169:    10         545.3125
170:    10         545.3125
171:    11         540.0000

Grouped transformations create new columns

copy(dogs) |>
  _[, price_gap_from_breed_mean :=
    price - mean(price, na.rm = TRUE),
    by = breed] |>
  _[breed %in% c("Pug", "Boxer"),
    .(dog_id, name, breed, price, price_gap_from_breed_mean)]

Key: <dog_id, breed>
    dog_id     name  breed  price price_gap_from_breed_mean
    <char>   <char> <char>  <num>                     <num>
 1:   1927   Oliver    Pug 750.00                 -95.82812
 2:   2370     Juno    Pug 910.00                  64.17188
 3:   2646 Snowball    Pug 620.50                -225.32812
---                                                        
21:   8522    Rocky  Boxer 500.00                -177.30714
22:   9883    Simba    Pug 930.00                  84.17188
23:   9988    Ziggy  Boxer 610.25                 -67.05714

`.SD`

Inside a data.table, .SD refers to the Subset of Data for each group.

We can use this to run more complex operations, like

dogs[, lapply(.SD, func), by = ...]

applying a custom function func to each .SD group.

`.SDcols`

Specifies the columns included in .SD. Can be used to select columns in smart ways.

dogs[, c(
  list(dogs = .N),
  lapply(.SD, function(x) round(mean(x, na.rm = TRUE), 2))
), by = breed, .SDcols = c("age", "price")]

                 breed  dogs   age  price
                <char> <int> <num>  <num>
 1:      Border Collie    13  5.00 678.60
 2: Labrador Retriever     6  3.00 708.71
 3:            Whippet    15  3.87 739.70
---                                      
19:          Dalmatian    12  5.00 658.62
20:       Irish Setter    10  4.90 676.28
21:             Beagle     7  5.14 536.00

Use `shift()` to create a lagged vector

shift() returns a lagged vector within groups

events

    dog_id          event_time event_type           shelter_name staff_id event_date
    <char>              <POSc>     <char>                 <char>   <char>     <IDat>
 1:   0013 2023-01-15 08:42:00     intake     Happy Paws Shelter      A12 2023-01-15
 2:   0013 2023-03-20 23:40:00    adopted     Happy Paws Shelter      B07 2023-03-20
 3:   3382 2023-02-10 09:05:00     intake       City Animal Care      A12 2023-02-10
---                                                                                 
81:   1254 2023-09-16 13:25:00  vet_check     Happy Paws Shelter      V01 2023-09-16
82:   4094 2023-09-04 14:50:00  vet_check     Happy Paws Shelter      V01 2023-09-04
83:   5522 2024-01-08 10:20:00  vet_check Willow Creek Adoptions      V03 2024-01-08

setorder(events, dog_id, event_time) # setorder() is crucial!
events[,
  hours_since_previous_event := round(
    as.numeric(difftime(event_time, shift(event_time), units = "hours")), 1
  ),
  by = dog_id
]
events[
  dog_id %in% c("0013", "3382"),
  .(dog_id, event_time, event_type, hours_since_previous_event)
]

   dog_id          event_time event_type hours_since_previous_event
   <char>              <POSc>     <char>                      <num>
1:   0013 2023-01-15 08:42:00     intake                         NA
2:   0013 2023-03-20 23:40:00    adopted                     1551.0
3:   3382 2023-02-10 09:05:00     intake                         NA
4:   3382 2023-02-13 14:30:00  vet_check                       77.4
5:   3382 2023-04-01 11:30:00    adopted                     1124.0

Missing Values And Type Repair

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Missing values are a workflow issue

Some missings are genuine unknowns
Some are fake codes like -99
Some are actually impossible values that should become NA

Code fake missings at read time when you can

dogs_raw <- fread(dog_path, skip = 3, encoding = "Latin-1")
dogs <- fread(
  dog_path,
  skip = 3,
  encoding = "Latin-1",
  check.names = TRUE,
  na.strings = c("", "-99"),
  colClasses = list(character = "dog_id")
)

unique(dogs_raw$insured)

[1]   1   0  NA -99

unique(dogs$insured)

[1]  1  0 NA

Use `is.na()` explicitly

dogs[is.na(insured), .(dog_id, name, insured)]

   dog_id    name insured
   <char>  <char>   <int>
1:   6015 Patches      NA
2:   6941  Zephyr      NA
3:   4890  Anubis      NA
4:   8857   Snowy      NA

Some values are not missing, just wrong

dogs[age < 0, .(dog_id, name, age)]

   dog_id    name   age
   <char>  <char> <int>
1:   6455    Coco    -1
2:   6859  Härley    -3
3:   4967    Növa    -2
4:   6301 Düke II    -4
5:   5131     Rex    -2
6:   7690  Wolfie    -5

Repair obvious problems explicitly

dogs_fixed <- copy(dogs)
dogs_fixed[age < 0, age := NA_integer_]

dogs_fixed[, .(
  missing_age = sum(is.na(age)),
  missing_sale_date = sum(is.na(sale_date)),
  missing_insured = sum(is.na(insured))
)]

   missing_age missing_sale_date missing_insured
         <int>             <int>           <int>
1:           6                37               4

`fcoalesce()` fills missings with a default

dogs_fixed[, owner_label := fcoalesce(new.owner, "not yet sold")]

dogs_fixed[
  is.na(new.owner),
  .(dog_id, name, new.owner, owner_label)
][1:5]

   dog_id    name new.owner  owner_label
   <char>  <char>    <char>       <char>
1:   6988    Bear      <NA> not yet sold
2:   8380    Ruby      <NA> not yet sold
3:   8774    Duke      <NA> not yet sold
4:   1288   Penny      <NA> not yet sold
5:   8853 Winston      <NA> not yet sold

Returns the first non-NA value, argument by argument
Replaces nested ifelse(is.na(x), y, x) patterns
Accepts more than two arguments: fcoalesce(a, b, c)

Date parsing is a type problem too

class(dogs$intake_date)

[1] "IDate" "Date"

class(dogs$sale_date)

[1] "character"

intake_date was parsed as IDate (data.table’s date type, inherits from Date)
sale_date is still text — the 03/20-2023 format wasn’t recognized

Parse the dates explicitly

dogs_fixed[, sale_date := as.IDate(sale_date, format = "%m/%d-%Y")]

class(dogs_fixed$sale_date)

[1] "IDate" "Date"

dogs_fixed[1:3, .(dog_id, intake_date, sale_date)]

   dog_id intake_date  sale_date
   <char>      <IDat>     <IDat>
1:   0013  2023-01-15 2023-03-20
2:   3382  2023-02-10 2023-04-01
3:   4200  2023-03-05 2023-06-15

as.IDate() parses text into IDate, given the format string
%m/%d-%Y matches 03/20-2023: month, /, day, -, four-digit year

Writing Clean Data

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Separate raw, processed, and analysis steps

Raw files are inputs
Processed files are reproducible intermediate outputs
Analysis starts from processed data, not from manual edits

Write a cleaned file with `fwrite()`

out_dir <- file.path(tempdir(), "ec7422-wrangling")
dir.create(out_dir, showWarnings = FALSE)

out_path <- file.path(out_dir, "dogs_clean.csv")
fwrite(dogs_fixed, out_path)

Read it back and verify

dogs_roundtrip <- fread(
  out_path,
  colClasses = list(character = "dog_id")
)
dogs_roundtrip[1:3, .(dog_id, name, age, sale_date)]

   dog_id   name   age  sale_date
   <char> <char> <int>     <IDat>
1:   0013  Buddy     5 2023-03-20
2:   3382   Lucy     3 2023-04-01
3:   4200    Max     7 2023-06-15

Using Functions when Wrangling

Importing Data
Core data.table Grammar
Useful data.table Tooling
Grouping and Summarizing
Missing Values And Type Repair
Writing Clean Data
Using Functions when Wrangling

Repetition is the best reason to write a function

repeated import rules
repeated validation checks
repeated cleaning with one file path changed
repeated summaries that should agree exactly

A wrangling function should show its contract

explicit arguments
one clear output
no hidden dependence on objects in the global environment
one place to store the weird file-specific rules

Put the dog import rules in one place

read_dogs <- function(path) {
  fread(
    path,
    skip = 3,
    encoding = "Latin-1",
    check.names = TRUE,
    na.strings = c("", "-99"),
    colClasses = list(character = "dog_id")
  )
}

read_dogs(dog_path)[
  1:3,
  .(dog_id, name, intake_date, sale_date)
]

   dog_id   name intake_date  sale_date
   <char> <char>      <IDat>     <char>
1:   0013  Buddy  2023-01-15 03/20-2023
2:   3382   Lucy  2023-02-10 04/01-2023
3:   4200    Max  2023-03-05 06/15-2023

Hidden globals make even simple functions misleading

bad_oldest <- function(dt) dogs[which.max(age), .(dog_id, name, age)]
oldest_dog <- function(dt) dt[which.max(age), .(dog_id, name, age)]

puppies <- dogs[age <= 2]

bad_oldest(puppies)

   dog_id   name   age
   <char> <char> <int>
1:   6197  Benji    11

oldest_dog(puppies)

   dog_id   name   age
   <char> <char> <int>
1:   1893   Lúna     2

Returning a table keeps the effect visible

add_senior_flag <- function(dt, age_cutoff = 7L) {
  out <- copy(dt)
  out[, senior := age >= age_cutoff]
  out
}

add_senior_flag(dogs)[, .(dog_id, age, senior)]

     dog_id   age senior
     <char> <int> <lgcl>
  1:   0013     5  FALSE
  2:   3382     3  FALSE
  3:   4200     7   TRUE
 ---                    
169:   8857     2  FALSE
170:   0126     5  FALSE
171:   8159     0  FALSE

Default arguments keep small helpers flexible

add_senior_flag(dogs)[, .N, by = senior]

   senior     N
   <lgcl> <int>
1:  FALSE   131
2:   TRUE    40

add_senior_flag(dogs, age_cutoff = 9L)[, .N, by = senior]

   senior     N
   <lgcl> <int>
1:  FALSE   159
2:   TRUE    12

Main takeaways

A successful import is not the same as a correct import
Inspect names, types, keys, and missingness immediately
Read DT[i, j, by] as “i=rows, j=columns/calculations, by=groups”
Use := deliberately and copy() when you need separation

Lecture 6: Data Wrangling I

Today

Importing: Raw data is not always easy to read

Wrangling: Make data usable for analysis

Importing Data

Repeating the import checklist

data.table::fread() for text files

Let’s import data from an AI-generated kennel

View your raw data

fread() is forgiving, but can’t solve all issues

Encoding problems show up in the values

Tell fread what you know

Repairing names

Protect identifiers at import time

A good import call is explicit

select the columns you use

Importing Stata .dta files with haven

Reading excel files with readxl

Core data.table Grammar

data.table and dplyr

data.table: Core concept

Read DT[i, j, by] as a sentence

i filters rows

i can use membership too

%between% filters ranges

j selects columns

j can compute new outputs

:= modifies by reference

Use copy() when you need separation

Comparison to dplyr

data.table vs. dplyr - key differences

Useful data.table Tooling

Setting a key creates an index for faster operations

Removing columns by reference with := NULL

Renaming and reordering by reference

fcase() is cleaner than nested ifelse()

Diagnose distinct values and duplicates

Piping with data.table

Grouping and Summarizing

Grouping and summarizing with by=

Grouping and summarizing with by= (cont.)

Grouping and summarizing with by= (cont.)

Grouped counts are useful diagnostics

Transforming with := and by=

Grouped transformations create new columns

.SD

.SDcols

Use shift() to create a lagged vector

Missing Values And Type Repair

Missing values are a workflow issue

Code fake missings at read time when you can

Use is.na() explicitly

Some values are not missing, just wrong

Repair obvious problems explicitly

fcoalesce() fills missings with a default

Date parsing is a type problem too

Parse the dates explicitly

Writing Clean Data

Separate raw, processed, and analysis steps

Write a cleaned file with fwrite()

Read it back and verify

Using Functions when Wrangling

Repetition is the best reason to write a function

A wrangling function should show its contract

Put the dog import rules in one place

Hidden globals make even simple functions misleading

Returning a table keeps the effect visible

Default arguments keep small helpers flexible

Main takeaways

Next lecture: advanced wrangling

`data.table::fread()` for text files

`fread()` is forgiving, but can’t solve all issues

Tell `fread` what you know

`select` the columns you use

Importing Stata `.dta` files with `haven`

Reading excel files with `readxl`

`data.table` and `dplyr`

`data.table`: Core concept

Read `DT[i, j, by]` as a sentence

`i` filters rows

`i` can use membership too

`%between%` filters ranges

`j` selects columns

`j` can compute new outputs

`:=` modifies by reference

Use `copy()` when you need separation

Comparison to `dplyr`

`data.table` vs. `dplyr` - key differences

Removing columns by reference with `:= NULL`

`fcase()` is cleaner than nested `ifelse()`

Piping with `data.table`

Grouping and summarizing with `by=`

Grouping and summarizing with `by=` (cont.)

Grouping and summarizing with `by=` (cont.)

Transforming with `:=` and `by=`

`.SD`

`.SDcols`

Use `shift()` to create a lagged vector

Use `is.na()` explicitly

`fcoalesce()` fills missings with a default

Write a cleaned file with `fwrite()`