Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Singular Value Decomposition (Also explains PCA)

The below is a reproduction of an answer in the Coursera discussion forum to the question that SVD was too complicated to understand and the material available on the web, directly goes into math instead of explaining what SVD and PCA really does.

Ive reproduced it here because it is too good and no part needs to be edited. Also once the course is archived this will not be available anymore.

Full credit goes to Pete Kazmier – https://class.coursera.org/dataanalysis-002/forum/profile?user_id=5233139

============================================================================

After reading more and going back to the lectures, I think I finally understand the practical aspect of SVD/PCA when it comes to a data analysis. Most of the material I found online was focused on “how” these tools work and the math behind them, which is of little interest to me. I’m much more interested in the use of the tools. In short, I drive a car to work everyday, but I don’t care how its engine is built, only that it gets me from point A to point B. The following is my attempt to help others move past these lectures with some understanding of the material and how it relates to data analysis.

Let’s use the diamond data set as an example:

> data(diamonds)
> str(diamonds)


'data.frame':    53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...  

You’ll notice that most of the data frame is numerical, with the exception of cut, color, and clarity which are ordered factors. Why do I point this out? Because if you want to use SVD or PCA, you need numerical data. We can perform the SVD/PCA without these columns, but why not convert these factors to numbers so we can use the entire data frame.

> diamonds$cut <- as.numeric(diamonds$cut)
> diamonds$clarity <- as.numeric(diamonds$clarity)
> diamonds$color <- as.numeric(factor(diamonds$color, levels=rev(levels(diamonds$color))))
> str(diamonds)


'data.frame':    53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : num  5 4 2 4 2 3 3 3 1 3 ...
 $ color  : num  6 6 6 2 1 1 2 3 6 3 ...
 $ clarity: num  2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

As you can see, I used the as.numeric() function to convert the cut and clarity columns. These two factors were already ordered such that the conversion results in lower values corresponding to lesser quality. Unfortunately, the colors column was in reverse order, where higher quality would be represented by lower values, so I converted colors and reversed the order of the levels in the process to maintain consistency.

Let’s move on to investigating SVD and PCA now. At a high level, these two tools take our data and break it apart into components of varying resolution. For example, if you were to think of each component as a transparency for an overhead projector, each time you add another transparency (component) on top of the previous one, you are adding more detail to the overall image projected (representing our data). At first the image may be blurry (an estimation of our original data), but if you stack all of the transparencies (components) on top of each other, you will eventually see the full high resolution image (the original data)! For those that don’t remember what an overhead projector is, I suppose you could use the analogy of looking at Google maps for a particular city. When zoomed out, you see the general shape of the city, but each time you zoom in further, more and more fine detail is added.

How do we break our data down into these transparencies? We’ll use the svd() and prcomp() functions to compute the SVD and PCA. I’ll show both calculations below, but I will let you in on a secret. After everything I’ve learned, I see no reason to use the SVD calculation directly as the PCA provides the same information with nicer column labels.

> s <- svd(scale(diamonds))
> p <- prcomp(diamonds, scale.=TRUE)

Note: before computing the SVD and PCA, I scaled the columns in our data frame to standardize them because it doesn’t make sense to compare two columns such as carat, where values are very small, and price where values are very large. With SVD, we explicitly scale() the date frame before calling svd(). With PCA, we just pass an additional argument to specify scaling. Let’s look at the results:

> str(s)


List of 3
 $ d: num [1:10] 520 278 258 231 204 ...
 $ u: num [1:53940, 1:10] -0.00587 -0.00572 -0.00519 -0.00452 -0.00327 ...
 $ v: num [1:10, 1:10] 0.44059 -0.08471 -0.13698 -0.18083 0.00771 ...


> str(p)


List of 5
 $ sdev    : num [1:10] 2.238 1.199 1.111 0.993 0.878 ...
 $ rotation: num [1:10, 1:10] 0.44059 -0.08471 -0.13698 -0.18083 0.00771 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "carat" "cut" "color" "clarity" ...
  .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:10] 0.798 3.904 4.406 4.051 61.749 ...
  ..- attr(*, "names")= chr [1:10] "carat" "cut" "color" "clarity" ...
 $ scale   : Named num [1:10] 0.474 1.117 1.701 1.647 1.433 ...
  ..- attr(*, "names")= chr [1:10] "carat" "cut" "color" "clarity" ...
 $ x       : num [1:53940, 1:10] -3.05 -2.97 -2.7 -2.35 -1.7 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

The SVD object has three vectors: d, u, and v. The PCA object has two of interest: sdev and rotation. Before we get into how to interpret this data, you should know that some of the variables are closely related. For example, d and sdev are essentially the same thing. Although the values look different, if we graph them as a percent of variation (square / sum of squares), you’ll notice they are identical:


> par(mfrow=c(1,2))
> plot(s$d^2 / sum(s$d^2))
> plot(p$sdev^2 / sum(p$sdev^2))

I’d show the graph, but I can’t seem to get this editor to insert an image.

Great, they are exactly the same, so what? What do they tell me? In short, they tell you how much detail is in each transparency (component). For each data set, the amount of detail on the various layers of transparencies will vary. In some cases, it may take only one layer to provide 99% of the detail. In other cases, it may take 10 layers of transparencies to provide 99% of the detail. SVD’s d vector or PCA’s sdev vector tells you how much detail is included in each layer. The x-axis represents the layers of transparency. In our example, there are a total of 10 layers. The y-axis represents the percentage that each layer contributes to the overall image. In the graph above, layer 1 contributes to 50% of the overall image (data set). layer 2 contributes an additional 14% of detail, layer 3 contributes an additional 12%, and so on. Each additional layer adds less and less detail back to the picture. Eventually, the contribution becomes very small. Layers 7, 8, 9, and 10 are only contributing an additional 1.3%, 0.4%, 0.3%, and 0.1% of detail.

We can see the same by looking at the values directly:

# Percentage that each layer contributes to overall image
> round((p$sdev^2 / sum(p$sdev^2)), 3)

[1] 0.501 0.144 0.123 0.099 0.077 0.036 0.013 0.004 0.003 0.001> 

# What percentage do the first 5 layers contribute?
> sum((p$sdev^2 / sum(p$sdev^2))[1:5])

[1] 0.9434833  

This means we can get reasonably close to the original image (data) by applying only a subset of the transparencies (components). For our diamond data set, we might be satisfied with using only the first five layers as they provide 94% of the original resolution. This is what is meant by dimension reduction. The resolution is a choice of how many transparencies (components) you want to use.

Okay, now that we have explained the d vector in the SVD computation and the equivalent sdev vector of the PCA computation, let’s move on to the next variable of interest: SVD’s v vector and PCA’s rotation vector. Again, they are one and the same. What do they look like again?

> str(s$v)

num [1:10, 1:10] 0.44059 -0.08471 -0.13698 -0.18083 0.00771 ...

> str(p$rotation)

num [1:10, 1:10] 0.44059 -0.08471 -0.13698 -0.18083 0.00771 ...
* attr(*, "dimnames")=List of 2
..$ : chr [1:10] "carat" "cut" "color" "clarity" ...
..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...

Each column of v or rotation corresponds to a specific layer of transparency (component) that we were discussing a moment ago. When we said that layer 1 contributes to 50% of the total image (data), we were referring to column 1. Let’s plot that layer to get a better idea of what it represents:

> plot(s$v[,1], pch=19)
> plot(p$rotation[,1], pch=19)

Again, these look the same because they represent the same thing. What are they telling us about our first transparency layer? The x-axis represents each variable of our original data frame: carat, cut, color, clarity, depth, etc … The y-axis tells us how much that particular variable contributes to the resolution of this particular layer. If the value is close to zero, that variable does not contribute much to this particular layer. On the other hand, the farther away from zero (positive or negative direction), the more it contributes to the detail on this particular transparency layer. Since I can’t seem to include images, let’s look at the actual values:

> round(s$v[,1], 2)

[1] 0.44 -0.08 -0.14 -0.18 0.01 0.11 0.40 0.44 0.44 0.43

> round(p$rotation[,1], 2)

carat cut color clarity depth table price x y z 
0.44 -0.08 -0.14 -0.18 0.01 0.11 0.40 0.44 0.44 0.43<

Looking at the values above (same as the y-axis in the graph), we can see that carat, price, x, y, and z variables influence the detail in the first transparency (component) more so than any other variables. Cut, color, and clarity influence the detail of our image (data) in the opposite direction, but to a lesser degree. Remember, this first transparency represents 50% of the detail of our original image (data).
So, if we were satisfied with this amount of resolution, we would see that the larger diamonds cost a lot of money compared to smaller diamonds. Interestingly, the cut, color, clarity,at this resolution, has a negative correlation to price and size.

The reason I prefer PCA over SVD is that it’s easy to quickly view multiple layers (components) because the PCA object is neatly labeled for us already. For example, earlier we stated that the first five layers (components) provided 94% of the detail to the overall image (original data). Well, let’s look at all five of those components:

> round(p$rotation[,1:5], 2)

          PC1   PC2   PC3   PC4   PC5
carat    0.44 -0.06  0.01 -0.01  0.05
cut     -0.08 -0.63  0.37 -0.22 -0.25
color   -0.14  0.15  0.04 -0.82  0.52
clarity -0.18 -0.28  0.22  0.49  0.74
depth    0.01 -0.12 -0.84  0.06  0.16
table    0.11  0.67  0.30  0.18  0.06
price    0.40 -0.14  0.09 -0.02  0.30
x        0.44 -0.05  0.04 -0.04  0.02
y        0.44 -0.05  0.05 -0.04  0.02
z        0.43 -0.06 -0.06 -0.03  0.04

Neat, huh? We already looked at layer 1 (PC1 in the table above), so let’s look at layer 2 (PC2). It looks like the greatest contributors to detail on this layer are the cut and table variables. This time the physical size of the diamond (x, y, z, and carat) have little influence. Depth plays a big role in layer 3 and is in contrast to cut and table. Transparency layers 4 and 5 add even more detail to the overall image.

There you have it. Using the information from the d/sdev vectors in conjunction with the v/rotation vectors can provide insight into which variables contribute most to your data set. The d/sdev will tell you how much detail each component (transparency) contributes to the overall data set (projected image), and the v/rotation vectors will tell you how the individual variables of your original data set influence a specific component (transparency).

I hope this helps others assuming my understanding is correct! Again, I was looking at this from a practical perspective of how to use the tools and what insights one might make as a result. I couldn’t care less about matrix multiplication or how they’re actually computed (no offense math lovers!). I just want to drive my car …

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

Referencesfor these notes :

Material from the discussion forum for the MOOC “Data Analysis” at Coursera.org

http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

http://www.stat.cmu.edu/~cshalizi/490/pca/pca-handout.pdf

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: