# Data Analysis and Visualisation practicals

Here you can find all information and files for the practicals of the elective master’s course Data Analysis and Visualisation at Utrecht University (course code `201600038` in Osiris).

You are going to be working inside the practicals folder. Download the folder and unzip it to a smart location on your computer.

# Name HTML PDF Answers
01 R basics for DAV .html .pdf
02 Data manipulation & EDA .html .pdf Answers
03 Data Visualisation using ggplot2 .html .pdf Answers
04 Assignment EDA .html .pdf
05 Supervised learning: Regression 1 .html .pdf Answers
06 Supervised learning: Regression 2 .html .pdf Answers
07 Supervised learning: Regression 3 .html .pdf Answers
08 Supervised learning: Classification 1 .html .pdf Answers
09 Supervised learning: Classification 2 .html .pdf Answers
10 Assignment Prediction Model .html .pdf
11 Unsupervised learning: PCA & Correspondence Analysis .html .pdf Answers
12 Unsupervised learning: Clustering .html .pdf Answers

## Prerequisites

• Install `R` and RStudio Desktop (open source) by following the instructions here
• If you don’t yet have a TeX distribution, install one:

If you have no experience with `R` or another programming language, you are going to need to catch up before starting the course and during the course. This is not an introductory course on programming with R, but a course on data analysis and visualisation.

Some good sources are:

The following is the minimum of what you should know about `R` before starting with the first practical

• What is `R` (a fancy calculator) and what is an `.R` file (a recipe for calculations)
• What is an `R` package (a set of functions you can download to use in your own code)
• How to run `R` code in `RStudio`
• What is a variable `x <- 10`
• What is a function `y <- fun(x = 10)`
• Understand what the following statements do (tip: you may run it in `R` line by line)
``````y <- "Let him go!"
x <- "Bismillah!"
z <- paste(x, "No, we will not let you go.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
``````
• Be able to read the help file of any function, (e.g., type `?plot` in the console)

## Outline of the practicals

Anything written in italic font is optional/extra material. You can look those up by yourself if you have extra time.

### Week 1

• R basics for DAV
• `R` and `RStudio`
• Project organisation
• Help files using `?`, CRAN, and internet search
• `R Markdown`
• The `ISLR` package (datasets from James ISLR)
• The `tidyverse` as a dialect of the `R` language (Wickham R4DS)
• The google style guide (ISLR does not follow it)
• R packages on GitHub
• Data manipulation & exploratory data analysis
• Data types: `character`, `numeric`, `factor`
• Lists
• Loading datasets from `.csv` or `.xlsx` (or other formats with `haven`)
• `data.frame()` and `tibble()`
• `View()`, `head()`, `tail()`
• `summary()`
• `filter()`, `select()`, and `mutate()` from `dplyr`
• `bind_rows()`, `bind_cols()`
• missing values (`na.omit`)
• `group_by()` and `summarise()` from `dplyr`
• the pipe operator `%>%`
• `table()`
• dplyr cheatsheet
• wide to long format: `gather` and `spread`

### Week 2

• Data Visualisation using ggplot2
• Preparing data for a `ggplot()` call
• What is a `ggplot` object and how to construct it
• Aesthetics: `x`, `y`, `size`, `colour`, `fill`
• `geom_point()`, `geom_line()`, `geom_bar()`
• Labels, limits
• `geom_boxplot()`, `geom_density()`
• themes (`ggthemes`?)

### Week 3

• HANDIN: Pass / Fail assignment
• Find a dataset and create an Exploratory Data Analysis
• Tip: The new Google dataset search.
• Format: stand-alone `RStudio` project folder with:
• the dataset (`csv`, `xlsx`, `sav`, `dat`, `json`, or any other common format)
• one `.Rmd` notebook file
• a compiled `.pdf` or `.html`
• Requirements:
• explain the dataset in 1 or 2 paragraphs
• use `tidyverse`
• clean, legible `R` code (preferably following the google style guide)
• table(s) with relevant summary statistics
• descriptive plots
• explain what you did and why (max 3 paragraphs total)
• Supervised learning: Regression 1
• `lm()`, the `formula` object, the `lm` object and its methods (`print()`, `summary()`, `coef()`, `plot()`)
• Regression lines in `ggplot` with uncertainty
• Linear regression with multiple variables, interaction effects
• Model assessment:
• Train/test split
• Mean square error calculation (`predict()`)
• AIC, BIC

### Week 4

• Supervised learning: Regression 2
• Feature selection
• Regularization using the `glmnet` package
• Optimising lambda

### Week 5

• Supervised learning: Regression 3
• Polynomial regression
• Nonlinear regression using the `splines` package
• Visualising nonlinear regression

### Week 6

• Supervised learning: Classification 1
• (titanic data? default data?)
• KNN
• LDA

### Week 7

• Supervised learning: assessing classification methods
• Confusion matrix, errors, AUC, ROC curve
• Cross validation on classification problems
• Classification trees

### Week 8

• HANDIN: Pass / Fail assignment
• Find a dataset and create and assess a prediction model
• Tip: The new Google dataset search.
• Format: stand-alone `RStudio` project folder with:
• the dataset (`csv`, `xlsx`, `sav`, `dat`, `json`, or any other common format)
• one `.Rmd` notebook file
• a compiled `.pdf` or `.html`
• a `.Rproj` file
• Requirements:
• explain the dataset in 1 or 2 paragraphs
• use `tidyverse`
• clean, legible `R` code (preferably following the google style guide)
• explain which method you use
• assess your predictions
• use plots where useful (they are almost always useful)
• Unsupervised learning: PCA & Correspondence Analysis
• PCA using `princomp`
• Visualising PCA
• SVD
• Correspondence Analysis & Biplots

### Week 9

• Unsupervised learning: Clustering
• K-means clustering with `kmeans()`
• Hierarchical clustering with `hclust()`
• Visualising clusters in `ggplot`
• Modularity clustering with igraph