data/processed/: cleaned or reshaped data created by your code
src/: code/scripts that do the work
output/: things your code produces for humans to inspect or submit
README.md: what the project is and how to run it
src/ is for your code
src/├── 00_run_pipeline.R | runs the project in the intended order├── 01_clean_data.R | cleans and standardizes inputs├── 02_merge_data.R | joins tables and checks keys├── 03_make_figures.R | produces human-facing output└── utils.R | reusable helper functions, if you really need them
Put code in src/, not in random top-level files
Numbers are fine for ordering, but never let them replace descriptive names
A good test: could another person (or agent) guess where to look for a specific step?
Split code by job, not by mood
Better: one (or several) script for cleaning, merging, figures
Prefer ignoring folders or generated artifacts you can recreate
Be careful with broad patterns like *.csv
The point is to track source and logic, not every disposable by-product
Short shell intro
Project Hygiene
Short shell intro
Version Control in Git
Git Branches
Git Merge conflicts
Git Failure Modes
Extras
The Unix shell
The Unix shell (cont.)
Hackers in movies are often portrayed working in terminals.
In reality, developers use shell commands because they offer efficient ways to interact with computers and files. We will work in a shell originating in the Unix family of operating systems.
The Unix philosophy is for tools to be “minimalist and modular” to:
Do One Thing And Do It Well
Terminology
Terminal, command prompt, tty: wrapper that runs the shell
Shell, Bash, zsh: programs that run commands and return output
Follow the instructions in Problem set 0 to set up your Git user.name and user.email. Otherwise you will get an error when pressing “Commit”.
Good commits are small and legible
Bad: update, changes, stuff
Better: Add README setup section
Better: Clean municipality names before merge
Commit early and often
Commit when one small idea is done
Git is not backup
Git tracks changes
It does not protect you from deleting the wrong folder
It does not make giant raw files pleasant to manage
Keep a real backup too (Box/Dropbox, external drive, etc.)
Git Branches
Project Hygiene
Short shell intro
Version Control in Git
Git Branches
Git Merge conflicts
Git Failure Modes
Extras
Branches
Want to test a large change, but unsure it will work?
Create a new branch to try it out, then just revert to main if it fails.
Keep track of your changes (with commits) without disturbing your collaborators with unfinished code.
If you are happy with your changes, merge them to the main branch.
Branching is awesome, use it!
Creating a new branch in VS Code
git switch --create test-branch
Merging branches locally
git switch maingit merge test-branch
Pull requests (GitHub feature)
A pull request is a proposal to merge one branch into another
Useful in team projects (code review), overkill for solo work
Pull request flow
Git Merge conflicts
Project Hygiene
Short shell intro
Version Control in Git
Git Branches
Git Merge conflicts
Git Failure Modes
Extras
Merge conflicts
A conflict happens when Git sees competing edits
Git stops because it does not want to guess
Annoying, yes
Still better than silent overwriting
Merge conflicts in VS Code
Resolving conflicts using the merge editor
Resolving conflicts using the merge editor
Example: Code change by multiple sources
Two contributors have made changes to the same file. Git stops merge to not overwrite changes. Requires manual conflict resolution.
Manual edits
VS Code provides a great UI for resolving merge conflicts, but you can also do it manually by editing the conflicting files. If you open the file you will see something like this. The strange characters are Git’s way of highlighting the merge conflict.
# test-repo<<<<<<< HEADA repo for testing and having fun.=======A repo for playing around.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.A not so cool change to the README file.
You fix the conflict manually by simply removing the characters and chosing the text you prefer.
Useful later: SSH, rebasing, tags, more careful collaboration workflows
Git Failure Modes
Project Hygiene
Short shell intro
Version Control in Git
Git Branches
Git Merge conflicts
Git Failure Modes
Extras
Boring failure modes
Opened the wrong folder
Edited a file outside the repo
Forgot to save before committing
Tried to commit generated junk
Saw an error and started guessing
First recovery habits
Read the error message
Check what folder you are in
Run git status
Look at the file tree
Change one thing at a time
If Git says “nothing to commit”
Maybe you changed nothing
Maybe you changed a file outside the repo
Maybe the file is ignored by .gitignore
Maybe you forgot to save
If push is rejected
Usually authentication failed, or the remote moved
Read the message before clicking random buttons
Do not delete the .git folder
Do not start over unless you understand why
If you get stuck
Stop making random changes
Figure out what state the project is in
Ask for help before you dig deeper
Most workflow problems are fixable if you stop early enough
What I want you to be able to do after today
Organize a project sanely
Use relative paths
Open a repo in VS Code
Make and explain a commit
Push work without panic
Understand what branches, pull requests, merge conflicts are for
Git advice
Commit often, work in small features
Don’t: “New data processing script”
Do: “Removed duplicates from survey data”
Push changes when you want your collaborators to see
Git is not backup!
Branches are useful even when solo
When all else fails… 🤷
Next Time: R Basics
Running code
Objects and vectors
Missing values
Functions
Reading simple errors without immediately spiraling
Extras
Project Hygiene
Short shell intro
Version Control in Git
Git Branches
Git Merge conflicts
Git Failure Modes
Extras
SSH=Secure shell
Logs in to server over remote channel
Really useful but setup requires some work
Uses public key cryptography
Private key - stored on your computer
Public key - stored on the server
When connecting the private key encrypts a message that can only be validated by the corresponding public key
SSH and GitHub
SSH is another way to authenticate with GitHub
HTTPS is fine for beginners
SSH becomes convenient if you use Git a lot
Generating a key pair
To generate an SSH key pair and register the public key with GitHub, follow these steps. For detailed instructions, see GitHub Docs.
ssh-keygen-t ed25519 -C"your_email@example.com"
Press Enter to accept the default file location. When prompted, enter a secure passphrase (your terminal will not show characters as you type).
Storing the passphrase using ssh-agent
You should secure your keys with a passphrase. But it might become annoying having to type the passphrase each time you use the SSH key. Instead, you can use a ssh-agent to store the passphrase for you. On Mac/Linux this is easy:
eval"$(ssh-agent-s)"ssh-add ~/.ssh/id_ed25519
On Windows, it’s more complicated. I’ve included the instructions in Problem Set 0 but let’s go through them together now.
Connecting to Github over SSH
To use SSH with Github you first need to add your public key to your GitHub profile. Go to settings and “SSH and GPG keys” to add it.
Afterwards, run:
#| error: TRUEssh-T git@github.com
Now you can clone repositories using SSH rather than https.
We will get back to how to use SSH to connect to servers later in the course.