In this practical, we will perform singular value decomposition and
perform principical component analysis. You will not have to load any
packages in advance, as we will solely use base
R. The data for todays
practical can be downloaded here:
In this exercise, the data of Example 1 from the lecture slides are
used. These data have been stored in the file
1. Use the function
read.table() to import the data into R. Use the
as.matrix() to convert the data frame to a matrix. The two
features are not centered. To center the two features the function
scale() with the argument
scale = FALSE can be used. Give the
centered data matrix a name, for instance,
2. Calculate the sample size
N and the covariance matrix
executing the following R code, where the function
t() is used to
calculate the transpose of
%*% is used for matrix
N <- dim(C) S <- t(C) %*% C/N
3. Use the function
svd() to apply a singular value decomposition to
the centered data matrix.
4. Inspect the three pieces of output, that is,
the three matrices the same as on the slides?
5. Use a single matrix product to calculate the principal component scores.
6. Plot the scores on the second principal component (y-axis) against the scores on the first principal component (x-axis) and let the range of the x-axis run from -18 to 18 and the range of the y-axis from -16 to 16.
7. Use the function
eigen() to apply an eigendecomposition to the
sample covariance matrix.
Note that an alternative form of the eigendecomposition of the covariance matrix S is given by
where vj is the jth eigenvector of S. From this alternative form it is clear that the (ortho-normal) eigenvectors are only defined up to sign because vjvjT = − vj( − vjT). This may lead to differences in sign in the output.
8. Check whether the eigenvalues are equal to the variances of the two
principal components. Be aware that the R-base function
N − 1 in the denominator, to get an unbiased estimate of the
9. Finally, calculate the percentage of total variance explained by each principal component.
In this exercise, a PCA is used to determine the financial strength of insurance companies. Eight relevant features have been selected: (1) gross written premium, (2) net mathematical reserves, (3) gross claims paid, (4) net premium reserves, (5) net claim reserves, (6) net income, (7) share capital, and (8) gross written premium ceded in reinsurance.
To perform a principal component analysis, an eigendecomposition can be applied to the sample correlation matrix R instead of the sample covariance matrix S. Note that the sample correlation matrix is the sample covariance matrix of the standardized features. These two ways of doing a PCA will yield different results. If the features have the same scales (the same units), then the covariance matrix should be used. If the features have different scales, then it’s better in general to use the correlation matrix because otherwise the features with high absolute variances will dominate the results.
The means and standard deviations of the features can be found in the following table.
The sample correlation matrix is given below.
R to apply a PCA to the sample correlation matrix.
An alternative criterion for extracting a smaller number of principal components m than the number of original variables p in applying a PCA to the sample correlation matrix, is the eigenvalue-greater-than-one rule. This rule says that m (the number of extracted principal components) should be equal to the number of eigenvalues greater than one. Since each of the standardized variables has a variance of one, the total variance is p. If a principal component has an eigenvalue greater than one, than its variance is greater than the variance of each of the original standardized variables. Then, this principal component explains more of the total variance than each of the original standardized variables.
10. How many principal components should be extracted according to the eigenvalue-greater-than-one rule?
11. How much of the total variance does this number of extracted principal components explain?
12. Make a scree-plot. How many principal components should be extracted according to the scree-plot?
13. How much of the total variance does this number of extracted principal components explain?
In this assignment, you will perform a PCA to a simple and easy to
understand dataset. You will use the
mtcars dataset, which is built
into R. This dataset consists of data on 32 models of car, taken from an
American motoring magazine (1974 Motor Trend magazine). For each car,
you have 11 features, expressed in varying units (US units). They are as
mpg: fuel consumption (miles per (US) gallon); more powerful and heavier cars tend to consume more fuel.
cyl: number of cylinders; more powerful cars often have more cylinders.
disp: displacement (cu.in.); the combined volume of the engine’s cylinders.
hp: gross horsepower; this is a measure of the power generated by the car.
drat: rear axle ratio; this describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will decrease fuel efficiency.
wt: weight (1000 lbs).
qsec: 1/4 mile time, the cars speed and acceleration.
vs: engine block; this denotes whether the vehicle’s engine is shaped like a ‘V’, or is a more common straight shape.
am: transmission; this denotes whether the car’s transmission is automatic (0) or manual (1).
gear: number of forward gears; sports cars tend to have more gears.
carb: number of carburetors; associated with more powerful engines.
Note that the units used vary and occupy different scales.
First, the principal components will be computed. Because PCA works best
with numerical data, you’ll exclude the two categorical variables (vs
and am; columns 8 and 9). You are left with a matrix of 9 columns and 32
rows, which you pass to the
prcomp() function, assigning your output
mtcars.pca. You will also set two arguments,
TRUE. This is done to apply a principal component analysis to
the standardized features. So, execute
mtcars.pca <- prcomp(mtcars[, c(1:7, 10, 11)], center = TRUE, scale = TRUE)
14. Have a peek at the PCA object with
You obtain 9 principal components, which you call PC1-9. Each of these explains a percentage of the total variance in the dataset.
15. What is the percentage of total variance explained by PC1?
16. What is the percentage of total variance explained by PC1, PC2, and PC3 together?
The PCA object
mtcars.pca contains the following information:
16. Determine the eigenvalues. How many principal components should be extracted according to the eigenvalue-greater-than-one rule?
17. What is the value of the total variance? Why?
18. How much of the total variance is explained by the number of extracted principal components according to the eigenvalue-greater-than-one rule?
Next, a couple of plots will be produced to visualize the PCA solution. You will make a biplot, which includes both the position of each observation (car model) in terms of PC1 and PC2 and also will show you how the initial features map onto this. A biplot is a type of plot that will allow you to visualize how the observations relate to one another in the PCA (which observations are similar and which are different) and will simultaneously reveal how each feature contributes to each principal component.
19. Use the function
biplot() with the argument
choices = c(1, 2)
to ask for a biplot for the first two principal components.
The axes of the biplot are seen as arrows originating from the center point. Here, you see that the variables hp, cyl, and disp all contribute to PC1, with higher values in those variables moving the observations to the right on this plot. This lets you see how the car models relate to the axes. You can also see which cars are similar to one another. For example, the Maserati Bora, Ferrari Dino and Ford Pantera L all cluster together at the top. This makes sense, as all of these are sports cars.
20. Make a biplot for the first and third principal components. Especially which brand of car has negative values on the first principal component and positive values on the third principal component?
21. Use the function
screeplot() with the argument
type = 'lines'
to produce a scree-plot. How many principal components should be
extracted according to this plot? Why? Is this number in agreement with
the number of principal components extracted according to the