# Data Analysis and Visualisation practicals

Here you can find all information and files for the practicals of the elective master’s course *Data Analysis and Visualisation* at Utrecht University (course code `201600038`

in Osiris).

You are going to be working inside the practicals folder. Download the folder and unzip it to a smart location on your computer.

## Links to practicals

# | Name | HTML | Answers | |
---|---|---|---|---|

01 | R basics for DAV | .html | ||

02 | Data manipulation & EDA | .html | Answers | |

03 | Data Visualisation using ggplot2 | .html | Answers | |

04 | Assignment EDA | .html | ||

05 | Supervised learning: Regression 1 | .html | Answers | |

06 | Supervised learning: Regression 2 | .html | Answers | |

07 | Supervised learning: Regression 3 | .html | Answers | |

08 | Supervised learning: Classification 1 | .html | Answers | |

09 | Supervised learning: Classification 2 | .html | Answers | |

10 | Assignment Prediction Model | .html | ||

11 | Unsupervised learning: PCA & Correspondence Analysis | .html | Answers | |

12 | Unsupervised learning: Clustering | .html | Answers |

## Prerequisites

**Install**`R`

and RStudio Desktop (open source) by following the instructions here**If you don’t yet have a TeX distribution, install one:**- MiKTeX on Windows
- MacTeX on OS X
- TeX Live or similar on Linux (you should be able to find one for your distro)

If you have no experience with `R`

or another programming language, you are going to need to catch up before starting the course and during the course. *This is not an introductory course on programming with R, but a course on data analysis and visualisation*.

Some good sources are:

- The first two chapters of introduction to R on datacamp
- Install
`R`

, play around, and read the workflow basics chapter in Hadley Wickham’s R for Data Science - Interactive R course: install
`R`

as in the previous point and in the console type the following lines one by one`install.packages("swirl") library(swirl) swirl()`

and follow the guide to run the

`R Programming: The basics of programming in R`

interactive course.

**The following is the minimum of what you should know about R before starting with the first practical**

- What is
`R`

(a fancy calculator) and what is an`.R`

file (a recipe for calculations) - What is an
`R`

package (a set of functions you can download to use in your own code) - How to run
`R`

code in`RStudio`

- What is a variable
`x <- 10`

- What is a function
`y <- fun(x = 10)`

- Understand what the following statements do (tip: you may run it in
`R`

line by line)`y <- "Let him go!" x <- "Bismillah!" z <- paste(x, "No, we will not let you go.", y) rep(z, 3) 1:10 sample(1:20, 4) sample(1:20, 40, replace = TRUE) z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1) z^2 z == 2 z > 2 install.packages("dplyr") library(dplyr)`

- Be able to read the help file of any function, (e.g., type
`?plot`

in the console)

## Outline of the practicals

Anything written in *italic font* is optional/extra material. You can look those up by yourself if you have extra time.

### Week 1

**R basics for DAV**`R`

and`RStudio`

- Project organisation
- Help files using
`?`

, CRAN, and internet search `R Markdown`

- The
`ISLR`

package (datasets from James ISLR) - The
`tidyverse`

as a dialect of the`R`

language (Wickham R4DS) - The google style guide (ISLR does not follow it)
*R packages on GitHub*

**Data manipulation & exploratory data analysis**- Data types:
`character`

,`numeric`

,`factor`

- Lists
- Loading datasets from
`.csv`

or`.xlsx`

(or other formats with`haven`

) `data.frame()`

and`tibble()`

`View()`

,`head()`

,`tail()`

`summary()`

`filter()`

,`select()`

, and`mutate()`

from`dplyr`

`bind_rows()`

,`bind_cols()`

- missing values (
`na.omit`

) `group_by()`

and`summarise()`

from`dplyr`

- the pipe operator
`%>%`

`table()`

*dplyr cheatsheet**wide to long format:*`gather`

and`spread`

- Data types:

### Week 2

**Data Visualisation using ggplot2**- Preparing data for a
`ggplot()`

call - What is a
`ggplot`

object and how to construct it - Aesthetics:
`x`

,`y`

,`size`

,`colour`

,`fill`

`geom_point()`

,`geom_line()`

,`geom_bar()`

- Labels, limits
`geom_boxplot()`

,`geom_density()`

*themes (*`ggthemes`

?)

- Preparing data for a

### Week 3

**HANDIN: Pass / Fail assignment***Find a dataset and create an Exploratory Data Analysis*- Tip: The new Google dataset search.
- Format: stand-alone
`RStudio`

project folder with:- the dataset (
`csv`

,`xlsx`

,`sav`

,`dat`

,`json`

, or any other common format) - one
`.Rmd`

notebook file - a compiled
`.pdf`

or`.html`

- the dataset (
- Requirements:
- explain the dataset in 1 or 2 paragraphs
- use
`tidyverse`

- clean, legible
`R`

code (preferably following the google style guide) - table(s) with relevant summary statistics
- descriptive plots
- explain what you did and why (max 3 paragraphs total)

**Supervised learning: Regression 1**`lm()`

, the`formula`

object, the`lm`

object and its methods (`print()`

,`summary()`

,`coef()`

,`plot()`

)- Regression lines in
`ggplot`

with uncertainty - Linear regression with multiple variables, interaction effects
- Model assessment:
- Train/test split
- Mean square error calculation (
`predict()`

) - AIC, BIC

- Bias/variance tradeoff

### Week 4

**Supervised learning: Regression 2**- Feature selection
- Regularization using the
`glmnet`

package - Optimising lambda

### Week 5

**Supervised learning: Regression 3**- Polynomial regression
- Nonlinear regression using the
`splines`

package - Visualising nonlinear regression

### Week 6

**Supervised learning: Classification 1**- (titanic data? default data?)
- KNN
- Logistic regression (see also 4.2)
- LDA

### Week 7

**Supervised learning: assessing classification methods**- Confusion matrix, errors, AUC, ROC curve
- Cross validation on classification problems
- Classification trees

### Week 8

**HANDIN: Pass / Fail assignment***Find a dataset and create and assess a prediction model*- Tip: The new Google dataset search.
- Format: stand-alone
`RStudio`

project folder with:- the dataset (
`csv`

,`xlsx`

,`sav`

,`dat`

,`json`

, or any other common format) - one
`.Rmd`

notebook file - a compiled
`.pdf`

or`.html`

- a
`.Rproj`

file

- the dataset (
- Requirements:
- explain the dataset in 1 or 2 paragraphs
- use
`tidyverse`

- clean, legible
`R`

code (preferably following the google style guide) - explain which method you use
- assess your predictions
- make conclusions about your predictions
- use plots where useful (they are almost always useful)

**Unsupervised learning: PCA & Correspondence Analysis**- PCA using
`princomp`

- Visualising PCA
- SVD
- Correspondence Analysis & Biplots

- PCA using

### Week 9

**Unsupervised learning: Clustering**- K-means clustering with
`kmeans()`

- Hierarchical clustering with
`hclust()`

- Visualising clusters in
`ggplot`

- Modularity clustering with igraph

- K-means clustering with