Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Principal Component Analysis

It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical  representation is not available, PCA is a powerful tool for analysing data. The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. This technique used in image compression

In other words, PCA is a method of dimensionality reduction, without sacrificing much of the data accuracy. It summarizes data with many independent variables into to a smaller set of derived variables, in a way that the first component has the maximum variance, followed by the 2nd and 3rd etc, and the covariance  of any of the component with any other component is 0.

It redistributes total variance in such a way that K components explain as much as possible of the total variance, where total variance = variance of  var 1 + variance of var 2 etc

For a step by step procedure of how to do this follow the link

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

The same steps reproduced in R are below.



library(UsingR); data(galton)
plot(jitter(galton$parent,factor=2),jitter(galton$child,factor=2),pch=19,col="blue")
standardize = function(x) {(x - mean(x))}
myscaledvar <- apply(galton, 2, function(x) (x-mean(x))) # get scaled data
plot(myscaledvar,pch=19,col="blue", xlim=c(-20,20))
# get covariance matrix
mycov = cov(myscaledvar)
# get eigen vector
my.eigen = eigen(mycov)
rownames(my.eigen$vectors)=c("child","parent") colnames(my.eigen$vectors)=c("PC1","PC2")

# sum of the eigen values = the total variance sum(my.eigen$values) var(myscaledvar[,1]) + var(myscaledvar[,2])
# The Eigen vectors are the principal components pc1.slope = my.eigen$vectors[1,1]/my.eigen$vectors[2,1] pc2.slope = my.eigen$vectors[1,2]/my.eigen$vectors[2,2] abline(0,pc1.slope,col="red") # the red line explains most of the variation abline(0,pc2.slope,col="green")

# See how much variation each eigenvector accounts for
pc1.var = 100*round(my.eigen$values[1]/sum(my.eigen$values),digits=2)
pc2.var = 100*round(my.eigen$values[2]/sum(my.eigen$values),digits=2)
pc1.var
pc2.var

# Multiply the scaled data by the principal components
loadings = my.eigen$vectors
scores = myvar %*% loadings # this multiplication re-expresses the data in terms of PCs, this is the transformed data

sd = sqrt(my.eigen$values)
rownames(loadings) = colnames(myvar)

# plotting the scores
plot(scores,ylim=c(-10,10),main="Data in terms of EigenVectors / PCs",xlab=xlab,ylab=ylab) abline(0,0,col="red")
abline(0,90,col="green")

# BiPlot - primary visuals of PCAs
scores.min = min(scores[,1:2])
scores.max = max(scores[,1:2])

# draw the axes plot(scores[,1]/sd[1],scores[,2]/sd[2],main="BiPlot",xlab=xlab,ylab=ylab,type="n") rownames(scores)=seq(1:nrow(scores))
abline(0,0,col="red")
abline(0,90,col="green")

# This is to make the size of the lines more apparent factor = 5
# First plot the variables as vectors arrows(0,0,loadings[,1]*sd[1]/factor,loadings[,2]*sd[2]/factor,length=0.1, lwd=2,angle=20, col="red") # the angle between the arrows turns out quite small. hence the correlation is high

text(loadings[,1]*sd[1]/factor*1.2,loadings[,2]*sd[2]/factor*1.2,rownames(loadings), col="red", cex=1.2)

# short way to do all the above in R
prcomp(galton) # this gives us the same values as above.

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

http://www.youtube.com/watch?v=BfTMmoDFXyE

http://gastonsanchez.wordpress.com/2012/06/17/principal-components-analysis-in-r-part-1/

http://www.youtube.com/watch?v=5zk93CpKYhg

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: