Principal Component Analysis | Chapter 2: The Actual Workings of PCA

The projection of the data points onto a principal component. Courtesy of Liorpachter

In this three-part series of posts, we will discuss a simple machine learning algorithm called Principal Component Analysis (PCA), which can be explained using only knowledge of linear algebra. This algorithm is considered simple and easy to understand, since the reader only requires basic knowledge in one sub field of mathematics. This topic will be divided into three chapters, they are designed to be read independently. So the reader can decide to read the entire series of posts, or read individual chapters depending on the reader’s needs/background. Please note that the reader is assumed to have at least a high school level of mathematical knowledge for understanding any of the above three posts. The three chapters are as follows:

Chapter 1: Understanding Principal Component Analysis. This post is written for readers who have little to no knowledge of linear algebra.
Chapter 2: The Actual Workings of PCA. This post is written for readers for readers with knowledge in linear algebra. It rigorously explains two different mathematical methods that power PCA: the Power Iteration Method, and the Singular Vector Decomposition (SVD).
Chapter 3: Eigenfaces using PCA. This post is written for readers with some programming experience. The chapter will discuss the eigenfaces technique that uses PCA, and will explain a code example written in python.

But of course, to understand the full picture and understand every nuke and cranny of the PCA algorithm, it is advised that the reader starts off from chapter 1 and work their way up to chapter 3.

. . .

2 The Actual Workings of PCA

The definition of the PCA method is, for a given data set and parameter $k$ , to compute the orthonormal $k$ -dimensional subspace (through the origin) that minimizes the average squared distance between the points and the subspace, or equivalently that maximizes the variance of the projections of the data points onto the subspace whose basis are called the top $k$ principal components.

In this section we will look at how to actually compute those top $k$ principal components using two different methods:

The Power Iteration method.
The Singular Value Decomposition (SVD) method.

But first, let’s look at what the top $k$ principal components actually are.

2.1 Characterizing the Principal Components

2.1.1 The Setup

The PCA method takes in $n$ $d$ -dimensional data points $\boldsymbol{x_1}, \dotsb , \boldsymbol{x_n} \in \mathbb{R}^d$ and a parameter $k \in \{1, 2, \dotsb , d\}$ – which is the number of the top principal components desired. We assume that the data is centered around the origin, meaning that $\sum_{i=1}^n \boldsymbol{x_i} = \boldsymbol{0}$ where $\boldsymbol{0}$ is the all-zero vector. This can be enforced by subtracting the sample mean $\boldsymbol{\bar{x}}=\frac{1}{n}\sum_{i=1}^n\boldsymbol{x_i}$ from every data point while preprocessing (preparing) the data. The output of the PCA method is defined as $k$ orthonormal vectors $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}$ – the “top $k$ principal components” – that maximize the objective function

(1) $\begin{equation*}\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^k\langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle ^2. \end{equation*}$

Recall that the inner product $\langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle$ is the projection of $\boldsymbol{x_i}$ onto $\boldsymbol{v_j}$ .

But how would one actually solve this optimization problem and compute the $\boldsymbol{v_j}$ ’s? Even with $k = 1$ , there is an infinite number of unit vectors to try. We’re going to try to build a geometric intuition so that the solution would seem straightforward in hindsight.

2.1.2 Rewriting the Problem

First, let’s consider the $k=1$ case, which we will later see that the case of general $k$ reduces to it. With $k=1$ the objective function is

(2) $\begin{equation*} \argmin_{\boldsymbol{v}:\|\boldsymbol{v}\|=1} \frac{1}{n}\sum_{i=1}^n \langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle ^2. \end{equation*}$

To rewrite this variance-maximization in (8) using linear algebra, first we take the data points $\boldsymbol{x_1}, \dotsb , \boldsymbol{x_n} \in \mathbb{R}^d$ and write them as the rows of an $n \times d$ matrix $\boldsymbol{X}$ as follows:

$\boldsymbol{X}=\left[ {\begin{array}{ccc} \boldsymbol{-} & \boldsymbol{x_1} & \boldsymbol{-} \\ \boldsymbol{-} & \boldsymbol{x_2} & \boldsymbol{-} \\ & . & \\ & . & \\ & . & \\ \boldsymbol{-} & \boldsymbol{x_n} & \boldsymbol{-} \\ \end{array} } \right].$

Thus, for a unit vector $\boldsymbol{v} \in \mathbb{R}^d$ we have

$\boldsymbol{Xv}=\left[ {\begin{array}{c} \langle \boldsymbol{x_1},\boldsymbol{v} \rangle \\ \langle \boldsymbol{x_2},\boldsymbol{v} \rangle \\ . \\ . \\ . \\ \langle \boldsymbol{x_n},\boldsymbol{v} \rangle \\ \end{array} } \right],$

so $\boldsymbol{Xv}$ is just a column vector populated with all the projection lengths of the $\boldsymbol{x_i}$ ’s onto the line spanned by $\boldsymbol{v}$ . Recall that from (8), we care about the squares of these inner products, which can be obtained by taking the inner product of $\boldsymbol{Xv}$ with itself:

$\boldsymbol{v}^T\boldsymbol{X}^T\boldsymbol{Xv}=(\boldsymbol{Xv})^T\boldsymbol{Xv}=\sum_{i=1}^n\langle \boldsymbol{x_i},\boldsymbol{v} \rangle ^2.$

And so, our variance-maximization problem can be rephrased as

(3) $\begin{equation*} \argmin_{\boldsymbol{v}:\|\boldsymbol{v}\|=1} \boldsymbol{v}^T\boldsymbol{Av}, \end{equation*}$

where $\boldsymbol{A}$ is a $d \times d$ matrix of the form $\boldsymbol{X}^T\boldsymbol{X}$ . This problem is called “maximizing a quadratic form”. Do note that we are ignoring the $\frac{1}{n}$ in (8) because the optimal solution $\boldsymbol{v}$ is the same with or without it.

The matrix $\boldsymbol{X}^T\boldsymbol{X}$ has a natural interpretation. The $(i, j)$ entry of this matrix is the inner product of the $i$ th row of $\boldsymbol{X}^T$ and the $j$ th column of $\boldsymbol{X}$ – i.e. of the $i$ th and $j$ th columns of $\boldsymbol{X}$ . So $\boldsymbol{X}^T\boldsymbol{X}$ just collects the inner products of columns of $\boldsymbol{X}$ , and is a symmetric matrix.

For example, suppose the data points $\boldsymbol{x_i}$ ’s represent documents, with dimensions (i.e. columns of $\boldsymbol{x_i}$ ) corresponding to words. Then the inner product of two columns of $\boldsymbol{x_i}$ measures how frequently the corresponding pair of words co-occur in a document. So after centering such an $\boldsymbol{X}$ , frequently co-occurring pairs of words correspond to positive entries of $\boldsymbol{X}^T\boldsymbol{X}$ and pairs of words that almost never appear together are negative entries. The matrix $\boldsymbol{X}^T\boldsymbol{X}$ is called the covariance or correlation matrix of the $\boldsymbol{x_i}$ ’s, depending on whether or not each of the coordinates was normalized so as to have unit variance in a preprocessing step.

2.1.3 Solving for the Diagonal Matrices Case

To gain some understanding for the optimization problem in (3) that PCA solves, let’s begin with a very special case: where A is a diagonal matrix

(4) $\begin{equation*} \left[ {\begin{array}{ccccc} \lambda_1 & & & & 0 \\ & \lambda_2 & & & \\ & & \ddots & & \\ 0 & & & & \lambda_d \end{array} } \right] \end{equation*}$

with sorted nonnegative entries $\lambda_1 \geq\lambda_2\geq\dotsb \geq\lambda_d\geq 0$ on the diagonal.

There is a simple geometric way to think about diagonal matrices. A $d \times d$ matrix maps points in $\mathbb{R}^d$ back to points in $\mathbb{R}^d$ – the matrix “moves $\mathbb{R}^d$ around”, in effect. For example, the matrix

$\left[ {\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array} } \right]$

moves each point $(x, y)$ of the plane to the point $(2x, y)$ with double the $x$ -coordinate and the same $y$ -coordinate. For example, the points of the circle ${(x, y) : x^2+y^2 = 1}$ are mapped to the points ${\frac{x^2}{2^2}+y^2=1}$ of the ellipse shown in Figure 5.

Figure 5: The point $(x, y)$ on the unit circle is mapped to $(2x, y)$ .

More generally, a diagonal matrix of the form (10) can be thought of as “stretching” $\mathbb{R}^d$ , with the $i$ th axis getting stretched by the factor $\lambda_i$ , and the unit circle being mapped to the corresponding “ellipsoid” (i.e. the analog of an ellipse in more than 2 dimensions).

A natural guess for the direction $\boldsymbol{v}$ that maximizes $\boldsymbol{v}^T\boldsymbol{Av}$ with $\boldsymbol{A}$ diagonal is the “direction of maximum stretch”, namely $\boldsymbol{v}=\boldsymbol{e_1}$ , where $\boldsymbol{e_1}=(1, 0, \dotsb , 0)$ denotes the first standard basis vector (keep in mind that $\lambda_1 \geq\lambda_2\geq\dotsb\geq\lambda_d\geq 0$ ). To verify the guess, let $\boldsymbol{v}$ be an arbitrary unit vector, and write

(5) $\begin{equation*} \boldsymbol{v}^T(\boldsymbol{Av})= \left[ {\begin{array}{cccc} v_1 & v_2 & \dotsb & v_d \end{array} } \right] . \left[ {\begin{array}{c} \lambda_1v_1 \\ \lambda_2v_2 \\ \vdots \\ \lambda_dv_d \end{array} } \right] = \sum_{i=1}^d v_i^2\lambda_i . \end{equation*}$

Since $\boldsymbol{v}$ is a unit vector, the $v_i^2$ ’s sum to 1. Thus $\boldsymbol{v}^T\boldsymbol{Av}$ is always an average of the $\lambda_i$ ’s, with the averaging weights given by the $v_i^2$ ’s. Since $\lambda_1$ is the biggest $\lambda_i$ , the way to make this average as large as possible is to set $v_1=1$ and $v_i=0$ for $i \geq 1$ . That is, $\boldsymbol{v}=\boldsymbol{e_1}$ maximizes $\boldsymbol{v}^T\boldsymbol{Av}$ , as per our guess.

2.1.4 Diagonals in Disguise

Let’s generalize our solution in Section 2.1.3 by considering matrices $\boldsymbol{A}$ that, while not diagonal, are really just “diagonals in disguise.” Geometrically, what we mean is that $\boldsymbol{A}$ still does nothing other than stretch out different orthogonal axes, possibly with these axes being a “rotated version” of the original ones. See Figure 6 for a rotated version of the previous example, which corresponds to the matrix

(6) $\begin{equation*} \left[ {\begin{array}{cc} \frac{3}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{3}{2} \end{array} } \right] = \left[ {\begin{array}{cc} \frac{3}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{3}{\sqrt{2}} \end{array} } \right] . \left[ {\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array} } \right] . \left[ {\begin{array}{cc} \frac{3}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ -\frac{1}{\sqrt{2}} & \frac{3}{\sqrt{2}} \end{array} } \right] \end{equation*}$

Figure 6: The same scaling as Figure 5, but now rotated 45 degrees.

So what’s a “rotated diagonal” in higher dimensions? The appropriate generalization of a rotation is an orthogonal matrix. Note that orthogonal matrices also capture operations like reflections and permutations of the coordinates. Recall that an orthogonal matrix $\boldsymbol{Q}$ is a square matrix where the columns are a set of orthonormal vectors – that is, each column has unit length, and the inner product of two different columns is 0. A key property of orthogonal matrices is that they preserve length – that is, $\|\boldsymbol{Qv}\| = \|\boldsymbol{v}\|$ for every vector $\boldsymbol{v}$ . We review briefly why this is: since the columns of $\boldsymbol{Q}$ are orthonormal vectors, we have

(7) $\begin{equation*} \boldsymbol{Q}^T\boldsymbol{Q}=\boldsymbol{I}, \end{equation*}$

since the $(i, j)$ entry of $\boldsymbol{Q}^T\boldsymbol{Q}$ is just the inner product of the $i$ th and $j$ th columns of $\boldsymbol{Q}$ (It also follows that $\boldsymbol{QQ}^T=\boldsymbol{I}$ . This shows that $\boldsymbol{Q}^T$ is also an orthogonal matrix, or equivalently that the rows of an orthogonal matrix are orthonormal vectors). This means that the inverse of an orthogonal matrix is simply its transpose. Example (12) is a good representation for this. Then, we have

$\|\boldsymbol{Qv}\|^2=(\boldsymbol{Qv})^T\boldsymbol{Qv}=\boldsymbol{v}^T\boldsymbol{Q}^T\boldsymbol{Qv}=\boldsymbol{v}^T\boldsymbol{v}=\|\boldsymbol{v}\|^2 ,$

showing that $\boldsymbol{Qv}$ and $\boldsymbol{v}$ have same norm.

Now consider a matrix $\boldsymbol{A}$ that can be written as $\boldsymbol{A}=\boldsymbol{QDQ}^T$ for an orthogonal matrix $\boldsymbol{Q}$ and diagonal matrix $\boldsymbol{D}$ as in (10) – this is what we mean by a “diagonal in disguise”. Such a matrix $\boldsymbol{A}$ has a “direction of maximum stretch” – the (rotated) axis that gets stretched the most (i.e. by $\lambda_1$ ). Since the direction of maximum stretch under $\boldsymbol{D}$ is $\boldsymbol{e_1}$ , the direction of maximum stretch under $\boldsymbol{A}$ is the direction that gets mapped to $\boldsymbol{e_1}$ under $\boldsymbol{Q}^T$ – which is $(\boldsymbol{Q}^−1 )^T\boldsymbol{e_1}$ or, equivalently, $\boldsymbol{Qe_1}$ . Notice that $\boldsymbol{Qe_1}$ is simply the first column of $\boldsymbol{Q}$ – the first row of $\boldsymbol{Q}^T$ .

This direction of maximum stretch is again the solution to the variance-maximization problem (8). To see this, first plug in this choice $\boldsymbol{v_1}=\boldsymbol{Qe_1}$ to obtain

$\boldsymbol{v_1}^T\boldsymbol{Av_1}=\boldsymbol{v_1}^T\boldsymbol{QDQ}^T\boldsymbol{v_1}=\boldsymbol{e_1}^T\boldsymbol{Q}^T\boldsymbol{QDQ}^T\boldsymbol{Qe_1}=\boldsymbol{e_1}^T\boldsymbol{De_1}=\lambda_1.$

Second, for every unit vector $\boldsymbol{v}$ , $\boldsymbol{Q}^T\boldsymbol{v}$ is also a unit vector (since $\boldsymbol{Q}$ is orthogonal), so $\boldsymbol{v}^T\boldsymbol{QDQ}^T\boldsymbol{v}$ is an average of the $\lambda_i$ ’s, just as in (11) (with averaging weights given by the squared coordinates of $\boldsymbol{Q}^T\boldsymbol{v}$ , rather than those of $\boldsymbol{v}$ ). Thus $\boldsymbol{v}^T\boldsymbol{Av}\leq\lambda_1$ for every unit vector $\boldsymbol{v}$ , implying that $\boldsymbol{v_1}=\boldsymbol{Qe_1}$ maximizes $\boldsymbol{v}^T\boldsymbol{Av}$ .

2.1.5 General Covariance Matrices

We’ve seen that when the matrix $\boldsymbol{A}$ can be written as $\boldsymbol{QDQ}^T$ for an orthogonal matrix $\boldsymbol{Q}$ and diagonal matrix $\boldsymbol{D}$ , it’s easy to understand how to maximize the variance in (8): the optimal solution is to set $\boldsymbol{v}$ equal to the first row of $\boldsymbol{Q}^T$ , and geometrically this is just the direction of maximum stretch when we view $\boldsymbol{A}$ as a map from $\mathbb{R}^n$ to itself called a linear operator. But we don’t want to solve the problem (8) only for diagonal matrices in disguise – we want to solve it for an arbitrary covariance matrix $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ . But wait, we’ve already done this: recall from linear algebra that every matrix $\boldsymbol{A}$ of the form $\boldsymbol{X}^T\boldsymbol{X}$ can be written as $\boldsymbol{A}=\boldsymbol{QDQ}^T$ for an orthogonal matrix $\boldsymbol{Q}$ and diagonal matrix $\boldsymbol{D}$ as in (10).

Recall from linear algebra that for some symmetric matrices $\boldsymbol{A}$ , the corresponding diagonal matrix $\boldsymbol{D}$ will have some negative entries. For symmetric matrices of the form $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ , however, all of the diagonal entries are nonnegative. Such matrices are called “positive semindefinite”. To see this fact, note that

if $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ , then $\boldsymbol{v}^T\boldsymbol{Av}=(\boldsymbol{Xv})^T\boldsymbol{Xv}\geq 0$ for every $\boldsymbol{v}$
if $\boldsymbol{A}=\boldsymbol{QDQ}^T$ with the $i$ th diagonal entry of $\boldsymbol{D}$ negative, then taking $\boldsymbol{v}=\boldsymbol{Qe_i}$ provides a vector with $\boldsymbol{v}^T\boldsymbol{Av}\textless 0$ .

Now to summarize, so far we’ve shown the following:

When $k=1$ , the solution to (8) is the first row of $\boldsymbol{Q}^T$ , where $\boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{QDQ}^T$ with $\boldsymbol{Q}$ orthonormal and $\boldsymbol{D}$ diagonal with sorted diagonal entries.

2.1.6 Solving for Larger Values of $\boldsymbol{k}$

Solving for $k\geq 1$ is no different than solving for $k=1$ . The solution to the variance-maximization problem in (7) is analogous, namely to pick the $k$ orthogonal axes that are stretched the most by $\boldsymbol{A}$ . The extension to the derivation above gives the following generalization:

For general $k$ , the solution to (7) is the first $k$ rows of $\boldsymbol{Q}^T$ , where $\boldsymbol{X}^T\boldsymbol{X}^T=\boldsymbol{QDQ}^T$ with $\boldsymbol{Q}$ orthonormal and $\boldsymbol{D}$ diagonal with sorted diagonal entries.

Note that since $\boldsymbol{Q}$ is orthonormal, the first $k$ rows of $\boldsymbol{Q}^T$ are indeed a set of $k$ orthonormal vectors, as required in (7). Note that, another way to arrive to the same $k$ vectors is to choose $\boldsymbol{v_1}$ as a the variance-maximizing direction (as in (8)); choose $\boldsymbol{v_2}$ as the variance-maximizing direction orthogonal to $\boldsymbol{v_1}$ ; choose $\boldsymbol{v_3}$ as the variance-maximizing direction orthogonal to both $\boldsymbol{v_1}$ and $\boldsymbol{v_2}$ ; and so on. For a matrix $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ , these are called the top $k$ principal components of $\boldsymbol{X}$ .

2.1.7 Correlation with Eigenvectors and Eigenvalues

Now that we’ve characterized the top $k$ principal components as the first $k$ rows of $\boldsymbol{Q}^T$ in the decomposition $\boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{QDQ}^T$ , an obvious question arises: How do we compute them? And the answer to that question follows from the reinterpretation of these $k$ vectors as being the eigenvectors of the covarience matrix $\boldsymbol{X}^T\boldsymbol{X}$ .

Recall from linear algebra that an eigenvector of a matrix $\boldsymbol{A}$ is a vector $\boldsymbol{v}$ that is stretched in the same direction by $\boldsymbol{A}$ , meaning $\boldsymbol{Av}=\lambda\boldsymbol{v}$ for $\lambda\in\mathbb{R}$ where the value $\lambda$ is the corresponding eigenvalue. Eigenvectors are just the “axes of stretch” of the geometric discussions above, with eigenvalues giving the stretch factors.

When we write $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ as $\boldsymbol{A}=\boldsymbol{QDQ}^T$ , we’re actually writing the matrix in terms of its eigenvectors and eigenvalues – the rows of $\boldsymbol{Q}^T$ are the eigenvectors of $\boldsymbol{A}$ , and the diagonal entries in $\boldsymbol{D}$ are the corresponding eigenvalues. To see this, consider the $i$ th row of $\boldsymbol{Q}^T$ , which can be written $\boldsymbol{Qe_i}$ . Since $\boldsymbol{Q}^T\boldsymbol{Q}=\boldsymbol{QQ}^T=\boldsymbol{I}$ we have, $\boldsymbol{AQe_i}=\boldsymbol{QDe_i}=\boldsymbol{\lambda_iQe_i}$ . Hence the $i$ th row of $\boldsymbol{Q}^T$ is the eigenvector of A with eigenvalue $\lambda_i$ . Thus, the following sentence is often given as a definition of the PCA method:

PCA boils down to computing the $k$ eigenvectors of the covariance matrix $\boldsymbol{X}^T\boldsymbol{X}$ that have the largest eigenvalues.

Now that we’ve redefined the top $k$ principal components as being eigenvectors, one begins to wonder: are those top $k$ principal components of $\boldsymbol{X}$ uniquely defined? Well to answer this, we need to note that the set of diagonal entries in the matrix $\boldsymbol{D}$ (the set of eigenvalues of $\boldsymbol{X}^T\boldsymbol{X}$ ) is uniquely defined, since recall that we’re ordering the coordinates so that the entries are in a decreasing order. So:

If these eigenvalues are all distinct, then the matrix $\boldsymbol{Q}$ is also unique up to a sign flip in each column. Meaning if $\boldsymbol{A}=\boldsymbol{QDQ}^T$ and $\boldsymbol{\hat{Q}}$ is $\boldsymbol{Q}$ with all signs flipped in some column, then $\boldsymbol{A}=\boldsymbol{\hat{Q}D\hat{Q}}^T$ as well.
If an eigenvalues occurs with a multiplicity $d$ , then there is a $d$ -dimensional subspace of corresponding eigenvectors. Thus, any orthogonal basis of this subspace can be chosen as the corresponding top $k$ principal components.

2.2 Using the Power Iteration Method

We first describe how to use the power iteration method to find the first eigenvector (which corresponds to the largest eigevalue), then explain how to use it to find the remaining ones.

2.2.1 The Algorithm

To understand the geometric intuition behind the method, recall that if one views the matrix $\boldsymbol{A}=\boldsymbol{X}^T\boldsymbol{X}$ as a function that maps the unit sphere to an ellipsoid, then the longest axis of the ellipsoid corresponds to the top eigenvector of $\boldsymbol{A}$ (see Figure 5 and Figure 6 for a visual representation). Given that the top eigenvector corresponds to the direction in which multiplication by $\boldsymbol{A}$ stretches the vector the most, it is natural to hope that if we start with a random vector $\boldsymbol{v}$ , and keep applying $\boldsymbol{A}$ over and over, then we will end up having stretched $\boldsymbol{v}$ so much in the direction of $\boldsymbol{A}$ ’s top eigenvector that the image of $\boldsymbol{v}$ will lie almost entirely in this same direction.

For instance, in Figure 2, applying $\boldsymbol{A}$ twice (rotate/stretch/rotate-back/rotate/stretch/rotate-back) will stretch the ellipsoid twice as far along the southwest-northeast direction (in the direction of the first eigenvector). Further applications of $\boldsymbol{A}$ will make this axis of stretch even more pronounced. Eventually, almost all of the points on the original unit circle get mapped to points that are very close to this axis as displayed in Figure 7 below .

Figure 7: Here we’ve applied $\boldsymbol{A}$ five times to the unit circle resulting with the dashed circle. Applying $\boldsymbol{A}$ more times will align all points of the unit circle along the top eigenvector of $\boldsymbol{A}$ .

The following is a formal statement of the power iteration method:

Algorithm 1
 Power Iteration
    Given matrix :
     Select random unit vector 
     For , set . If , then return .

Do note that often one replaces the computation of $\boldsymbol{Au_0}, \boldsymbol{A}^2\boldsymbol{u_0}, \boldsymbol{A}^3\boldsymbol{u_0}, \boldsymbol{A}^4\boldsymbol{u_0}\dots$ and instead uses repeated squaring – to replace the larger number of matrix-vector multiplications with a smaller number of matrix-matrix multiplications – and computes $\boldsymbol{Au_0}, \boldsymbol{A}^2\boldsymbol{u_0}, \boldsymbol{A}^4\boldsymbol{u_0},\dots$ .

2.2.2 The Analysis

To show that the power iteration method works, we first prove that if $\boldsymbol{A}=\boldsymbol{QDQ}^T$ then $\boldsymbol{A}^i=\boldsymbol{QD}^i\boldsymbol{Q}^T$ . Which means that $\boldsymbol{A}^i$ has the same orientation (i.e. eigenvectors) as $\boldsymbol{A}$ , but all the eigenvalues are raised to the power of $i$ (and hence are exaggerated). For example, if $\lambda_1\textgreater 2\lambda_2$ then $\lambda_1^{10}\textgreater 1024\lambda_2^{10}$ .

Claim 1 If $\boldsymbol{A}=\boldsymbol{QDQ}^T$ , then $\boldsymbol{A^i}=\boldsymbol{QD}^i\boldsymbol{Q}^T$ .

Proof of Claim 1: We can prove this via induction on $n$ . The base case $n=1$ is immediate. Now, assuming that the statement holds for $n=k$ with $k\geq 1$ , let’s check for $n=k+1$ :

$\boldsymbol{A}^{k+1}=\boldsymbol{A}^k.\boldsymbol{A}=\boldsymbol{QD}^k\boldsymbol{Q}^T\boldsymbol{QDQ}^T=\boldsymbol{QD}^k\boldsymbol{DQ}^T=\boldsymbol{QD}^{k+1}\boldsymbol{Q}^T.$

And thus, the induction step holds from the orthogonality of $\boldsymbol{Q}$ . $\blacksquare$

Now, we can quantify the performance of the power iteration algorithm using the following theorem:

Theorem 1 With probability of at least $\frac{1}{2}$ over the choice of $\boldsymbol{v_0}$ and for $t\geq 1$ :

$|\langle \frac{\boldsymbol{A}^t\boldsymbol{u_0}}{\|\boldsymbol{A}^t\boldsymbol{u_0}\|},\boldsymbol{v_1} \rangle|\geq 1-2d(\frac{\lambda_2}{\lambda_1}),$

where $\boldsymbol{v_1}$ is the top eigenvector of $\boldsymbol{A}$ with eigenvalue $\lambda_1$ and $\lambda_2$ is the second-largest eigenvalue of $\boldsymbol{A}$ .

This result shows that the number of iterations required scales as $\frac{\log d}{\log(\lambda_1/\lambda_2)}$ . The ratio $\lambda_1/\lambda_2$ is an important parameter called the spectral gap of the matrix $\boldsymbol{A}$ . The bigger the spectral gap, the more pronounced is the direction of maximum stretch (compared to other axes of stretch). If the spectral gap is large, then we are in excellent shape. If $\lambda_1\approx\lambda_2$ , then the algorithm might take a long time (or might never) find $\boldsymbol{v_1}$ . If $\lambda_1=\lambda_2$ , then $\boldsymbol{v_1}$ and $\boldsymbol{v_2}$ are not uniquely defined – there is a two-dimensional subspace of eigenvectors with this eigenvalue. In this case, the power iteration algorithm will simply return a vector that lies in this subspace, which is the correct thing to do.

For example, if $t\textgreater 10\frac{\log d}{\log (\lambda_1/\lambda_2)}$ then $|\langle \frac{\boldsymbol{u_t}}{\|\boldsymbol{u_t}\|},\boldsymbol{v_1}\rangle|\textgreater -2d.d^{-10}\textgreater 0.9999$ for any $d\geq 3$ , and so $\boldsymbol{u_t}$ is essentially pointing in the same direction as $\boldsymbol{v_1}$ .

The “with probability $1/2$ ” statement in Theorem 1 can be strengthened to “with probability at least $1-1/2^{100}$ ” by repeating the above algorithm 100 times (for independent choices of $\boldsymbol{u_0}$ ), and outputting the recovered unit-length vector $\boldsymbol{u}$ that maximizes $\|\boldsymbol{Au}\|$ .

Proof of Theorem 1: Let $\boldsymbol{v_1}, \boldsymbol{v_2}, \dots, \boldsymbol{v_d}$ denote the eigenvectors of $\boldsymbol{A}$ , with associated eigenvalues $\lambda_1\geq\lambda_2\geq\dots\geq$ . These vectors form an orthonormal basis for $\mathbb{R}^d$ . Write the random initial vector $\boldsymbol{u_0}=\sum_{j=1}^d{c_j\boldsymbol{v_j}}$ in terms of this basis. We claim that the probability at least $1/2$ , $|c_1|\geq 1/2\sqrt{d}$ . This follows straightforwardly from a computation, using the fact that we can choose a random unit vector by selecting each coordinate independently from a Gaussian of variance $1/d$ , and then normalizing (by a factor that will be very close to $1$ ).

Given that $|c_1|\geq 1/2\sqrt{d}$ , we have the following:

$|\langle \frac{\boldsymbol{A}^t\boldsymbol{u_0}}{\|\boldsymbol{A}^t\boldsymbol{u_0}\|},\boldsymbol{v_1} \rangle|=\frac{c_1\lambda_1^t}{\sqrt{\sum_{i=1}^d{(\lambda_i^tc_i)^2}}}\geq\frac{c_1\lambda_1^t}{\sqrt{c^2\lambda_1^{2t}+d\lambda_2^{2t}}}\geq\frac{c_1\lambda_1^t}{c_1\lambda_1^t+\lambda_2^t\sqrt{d}}\geq1-2\sqrt{d}(\frac{\lambda_2}{\lambda_1})^t,$

where the second-to-last inequality follows from the fact that for any $a,b\geq 0$ , $\sqrt{a^2+b^2}\leq a+b$ , and the last inequality follows from dividing the top and bottom by the numerator, $c_1\lambda_1^t$ , and noting that for $x\textgreater 0$ it holds that $1/(1+x)\geq 1-x$ . $\blacksquare$

2.2.3 Computing the Remaining Principal Components

Applying the power iteration algorithm to the covariance matrix $\boldsymbol{X}^T\boldsymbol{X}$ of a data matrix $\boldsymbol{X}$ finds (a close approximation to) the top principal component of $\boldsymbol{X}$ . We can reuse the same method to compute subsequent principal components one-by-one, up to the desired number $k$ . Specifically, to find the top $k$ principal components:

Find the top component, $\boldsymbol{v_1}$ , using power iteration.
Project the data matrix orthogonally to $\boldsymbol{v_1}$ :

$\left[ {\begin{array}{ccc} \boldsymbol{-} & \boldsymbol{x_1} & \boldsymbol{-} \\ \boldsymbol{-} & \boldsymbol{x_2} & \boldsymbol{-} \\ & \vdots & \\ \boldsymbol{-} & \boldsymbol{x_n} & \boldsymbol{-} \end{array} } \right] \mapsto \left[ {\begin{array}{ccc} \boldsymbol{-} & (\boldsymbol{x_1} \boldsymbol{-} \langle \boldsymbol{x_1},\boldsymbol{v_1}\rangle \boldsymbol{v_1}) & \boldsymbol{-} \\ \boldsymbol{-} & (\boldsymbol{x_2} \boldsymbol{-} \langle \boldsymbol{x_1},\boldsymbol{v_1}\rangle \boldsymbol{v_1}) & \boldsymbol{-} \\ & \vdots & \\ \boldsymbol{-} & (\boldsymbol{x_n} \boldsymbol{-} \langle \boldsymbol{x_n},\boldsymbol{v_1}\rangle \boldsymbol{v_1}) & \boldsymbol{-} \end{array} } \right] .$

This corresponds to subtracting out the variance of the data that is already explained by the first principal component $\boldsymbol{v_1}$ .

Recurse by finding the top $k-1$ principal components of the new data matrix.

The correctness of this greedy algorithm follows from the fact that the $k$ -dimensional subspace that maximizes the norms of the projections of a data matrix $\boldsymbol{X}$ contains the $(k-1)$ -dimensional subspace that maximizes the norms of the projections.

2.2.4 But How Does One Chooses $\boldsymbol{k}$ ?

How do you know how many principal components are “enough”? For data visualization, often you just want the first few. In other applications, like compression, the simple answer is that you don’t. In general, it is worth computing a lot of them and plotting their eigenvalues. Often the eigenvalues become small after a certain point – e.g., your data might have 200 dimensions, but after the first 50 eigenvalues, the rest are all tiny. Looking at this plot might give you some heuristic sense of how to choose the number of components so as to maximize the signal of your data, while preserving the low-dimensionality.

2.3 Using The Singular Value Decomposition Method

We have so far discussed in the previous section how to compute the top $k$ principal components using the Power Iteration Algorithm. The Power Iteration Method computes $k$ $n$ -dimentional vectors $\boldsymbol{u_0}, \boldsymbol{u_1}, \dots , \boldsymbol{u_{n-1}} \in \mathbb{R}^n$ which are good approximations and are almost equal to the $k$ eigenvectors with the associated top $k$ eigenvalues of the matrix $\boldsymbol{A}$ which is a representation of a linear mapping from the vector space $\mathbb{R}^n$ onto itself.

In this section we will focus on another algorithm called The Singular Value Decomposition (SVD). As we will see later on in this section, SVD expresses the matrix $\boldsymbol{A}$ as a list of its ingredients, ordered by “importance”, from which we will only keep the $k$ most important ingredients.

2.3.1 The Definitions

We’ll start with the formal definitions, and then discuss interpretations, applications, and connections to the Power Iteration Method. A singular value decomposition (SVD) of an $m \times n$ matrix $\boldsymbol{A}$ expresses the matrix as the product of three “simple” matrices:

(8) $\begin{equation*} \boldsymbol{A}=\boldsymbol{USV}^T, \end{equation*}$

where:

$\boldsymbol{U}$ is an $m \times m$ orthogonal matrix;
$\boldsymbol{V}$ is an $n \times n$ orthogonal matrix;
$\boldsymbol{S}$ is an $m \times n$ diagonal matrix with nonnegative entries, and with the diagonal entries sorted from high to low.

Note that in contrast to the decomposition discussed in section 2 – up until this SVD part – ( $\boldsymbol{A} = \boldsymbol{QDQ}^T$ when $\boldsymbol{A}$ has the form $\boldsymbol{X}T\boldsymbol{X}$ ), the orthogonal matrices $\boldsymbol{U}$ and $\boldsymbol{V}$ are not the same – since $\boldsymbol{A}$ need not be square, $\boldsymbol{U}$ and $\boldsymbol{V}$ need not even have the same dimensions. Do note that, even small numerical examples are tedious to do in detail – the orthogonality constraint on singular vectors ensures that most of the numbers are messy. The easiest way to get a feel for what SVDs look like is to feed a few small matrices into the SVD subroutine supported by your favorite environment (Matlab, python’s numpy library, etc).

The columns of $\boldsymbol{U}$ are called the left singular vectors of $\boldsymbol{A}$ (these are $m$ -vectors). The columns of $\boldsymbol{V}$ (that is, the rows of $\boldsymbol{V}^T$ ) are the right singular vectors of $\boldsymbol{A}$ (these are $n$ -vectors). The entries of $\boldsymbol{S}$ are the singular values of $\boldsymbol{A}$ . Thus with each singular vector (left or right) there is an associated singular value. The “first” or “top” singular vector refers to the one associated with the largest singular value, and so on. As visually shown in Figure 8 below.

Figure 8: The singular value decomposition (SVD). Each singular value in $\boldsymbol{S}$ has an associated left singular vector in $\boldsymbol{U}$ , and right singular vector in $\boldsymbol{V}$ .

To better see how the SVD expresses $\boldsymbol{A}$ as a “list of its ingredients”, check that the factorization $\boldsymbol{A} = \boldsymbol{USV}^T$ is equivalent to the expression

(9) $\begin{equation*} \boldsymbol{A} = \sum_{i=1}^{\textrm{min}(m,n)}s_i.\boldsymbol{u_iv_i}^T, \end{equation*}$

where $s_i$ is the $i$ th singular value and $\boldsymbol{u_i}$ , $\boldsymbol{v_i}$ are the corresponding left and right singular vectors. That is, the SVD expresses $\boldsymbol{A}$ as a nonnegative linear combination of $\textrm{min}{m,n}$ rank-1 matrices, with the singular values providing the multipliers and the outer products of the left and right singular vectors providing the rank-1 matrices.

Every matrix A has a SVD. The proof is not deep, but is better covered in a linear algebra course than here. Geometrically, thinking of an $m \times n$ matrix as a mapping from $\mathbb{R}^n$ to $\mathbb{R}^m$ , this fact is kind of amazing: every matrix $\boldsymbol{A}$ , no matter how weird, is only performing a rotation in the domain (multiplication by $\boldsymbol{V}^T$ ), followed by scaling plus adding or deleting dimensions (multiplication by $\boldsymbol{S}$ ) as needed, followed by a rotation in the range (multiplication by $\boldsymbol{U}$ ). Along the lines of last section’s discussion, the SVD is “more or less unique”. The singular values of a matrix are unique. When a singular value appears multiple times, the subspaces spanned by the corresponding left and right singular vectors are uniquely defined, but arbitrary orthonormal bases can be chosen for each. Also, do note that one can always multiply the $i$ th left and right singular vectors by $-1$ to get another SVD.

2.3.2 Methods of Computing the SVD

There are pretty good algorithms for computing the SVD of a matrix, details are covered in any numerical analysis course. One simple method worthy of being mentioned consists of finding the eigenvalues and eigenvectors of $\boldsymbol{AA}^T$ and $\boldsymbol{A}^T\boldsymbol{A}$ to calculate the SVD of $\boldsymbol{A}$ . The eigenvectors of $\boldsymbol{A}^T\boldsymbol{A}$ make up the columns of $\boldsymbol{V}$ , the eigenvectors of $\boldsymbol{AA}^T$ make up the columns of $\boldsymbol{U}$ . Also, the singular values in $\boldsymbol{S}$ are square roots of eigenvalues from $\boldsymbol{AA}^T$ or $\boldsymbol{A}^T\boldsymbol{A}$ . The singular values are the diagonal entries of the $\boldsymbol{S}$ matrix and are arranged in descending order (as mentioned before). The singular values are always real numbers. If the matrix $\boldsymbol{A}$ is a real matrix, then $\boldsymbol{U}$ and $\boldsymbol{V}$ are also real. The details of this method are further explained later on in this section.

2.3.3 Low-Rank Approximations from the SVD

Recalling that the SVD expresses a matrix A as a sum of rank- $1$ matrices (weighted by the corresponding singular values), a natural idea is to keep only the first k terms on the right-hand side of (15). That is, for $\boldsymbol{A}$ as in (15) and a target rank $k$ , the proposed rank- $k$ approximation is

(10) $\begin{equation*} \boldsymbol{\hat{A}}=\sum_{i=1}^ks_i.\boldsymbol{u_iv_i}^T, \end{equation*}$

where as usual we assume that the singular values have been sorted ( $s_1\geq s_2\geq\dots s_{\textrm{min}\{m,n\}}\geq 0$ ), and $\boldsymbol{u_i}$ and $\boldsymbol{v_i}$ denote the $i$ th left and right singular vectors. As the sum of $k$ rank- $1$ matrices, $\boldsymbol{\hat{A}}$ clearly has rank (at most) $k$ .

Here is an equivalent way to think about the proposed rank- $k$ approximation (see also Figure 9 below).

Figure 9: Low rank approximation via SVD. Recall that $\boldsymbol{S}$ is non-zero only on its diagonal, and the diagonal entries of $\boldsymbol{S}$ are sorted from high to low. Our low rank approximation is $\boldsymbol{A_k}=\boldsymbol{U_kS_kV_k}^T$ .

Compute the SVD $\boldsymbol{A}=\boldsymbol{USV}^T$ , where $\boldsymbol{U}$ is an $m\times m$ orthogonal matrix, $\boldsymbol{S}$ is a nonnegative $m\times n$ diagonal matrix with diagonal entries sorted from high to low, and $\boldsymbol{V}^T$ is a $n\times n$ orthogonal matrix.
Keep only the top $k$ right singular vectors: set $\boldsymbol{V_k}^T$ equal to the first $k$ rows of $\boldsymbol{V}^T$ (a $k\times n$ matrix).
Keep only the top $k$ left singular vectors: set $\boldsymbol{U_k}$ equal to the first $k$ columns of $\boldsymbol{U}$ (an $m\times k$ matrix).
Keep only the top $k$ singular values: set $\boldsymbol{S_k}$ equal to the first $k$ rows and columns of $\boldsymbol{S}$ (a $k\times k$ matrix), corresponding to the $k$ largest singular values of $\boldsymbol{A}$ .
The rank- $k$ approximation is then

(11) $\begin{equation*} \boldsymbol{A_k}=\boldsymbol{U_kS_kV_k}^T \end{equation*}$

Storing the matrices on the right-hand side of (17) takes $O(k(m+n))$ space, in contrast to the $O(mn)$ space required to store the original matrix $\boldsymbol{A}$ . This is a big win when $k$ is relatively small and $m$ and $n$ are relatively large (as in many applications).

It is natural to interpret (17) as approximating the raw data $\boldsymbol{A}$ in terms of $k$ “concepts” (e.g. “math”, “music”, and “sports”), where the singular values of $\boldsymbol{S_k}$ express the signal strengths of these concepts, the rows of $\boldsymbol{V}^T$ and columns of $\boldsymbol{U}$ express the “canonical row/column” associated with each concept (e.g. a customer that likes only music products, or a product liked only by music customers), and the rows of $\boldsymbol{U}$ (respectively, columns of $\boldsymbol{V}^T$ ) approximately express each row (respectively, column) of $\boldsymbol{A}$ as a linear combination (scaled by $\boldsymbol{S_k}$ ) of the “canonical rows” (respectively, canonical columns).

Conceptually, this method of producing a low-rank approximation is as clean as could be imagined: we re-represent $\boldsymbol{A}$ using the SVD, which provides a list of $\boldsymbol{A}$ ’s “ingredients,” ordered by “importance”, and we retain only the $k$ most important ingredients. But is the result of this elegant computation any good?

The next fact justifies this approach: this low-rank approximation is optimal in a natural sense. The guarantee is in terms of the “Frobenius norm” of a matrix $\boldsymbol{M}$ , which just means applying the $\ell_2$ norm to the matrix as if it were a vector: $\|\boldsymbol{M}\|_F=\sqrt{\sum_{i,j}m^2_{i,j}}$ .

Fact 1 For every $m\times n$ matrix $\boldsymbol{A}$ , rank target $k\geq 1$ , and rank- $k$ $m\times n$ matrix $\boldsymbol{B}$ ,

$\|\boldsymbol{A}-\boldsymbol{A_k}\|_F\leq \|\boldsymbol{A}-\boldsymbol{B}\|_F,$

where $\boldsymbol{A_k}$ is the rank- $k$ approximation (17) derived from the SVD of $\boldsymbol{A}$ .

We won’t prove Fact 1 formally, but see the previous section for a plausibility argument based on the properties we’ve already established about the closely related PCA method.

2.3.4 How to Choose $\boldsymbol{k}$

When producing a low-rank matrix approximation, we’ve been taking as a parameter the target rank k. But how should $k$ be chosen? In a perfect world, the singular values of $\boldsymbol{A}$ give strong guidance: if the top few such values are big and the rest are small, then the obvious solution is to take $k$ equal to the number of big values. In a less perfect world, one takes $k$ as small as possible subject to obtaining a useful approximation – of course what “useful” means depends on the application. Rules of thumb often take the form: choose $k$ such that the sum of the top $k$ singular values is at least $c$ times as big as the sum of the other singular values, where $c$ is a domain-dependent constant (like $10$ , say).

2.3.5 Lossy Compression via Truncated Decomposition

Using the SVD to produce low-rank matrix approximations is another example of a useful paradigm for lossy compression. The first step of the paradigm is to re-express the raw data exactly as a decomposition into several terms (as in (14)). The second step is to throw away all but the “most important” terms, yielding an approximation of the original data. This paradigm works well when you can find a representation of the data such that most of the interesting information is concentrated in just a few components of the decomposition. The appropriate representation will depend on the data set – though there are a few rules of thumb, as we’ll discuss – and of course, messy enough data sets might not admit any nice representations at all.

2.3.6 PCA Reduces to SVD

There is an interesting relationship between the SVD and the decompositions we discussed in the last section. Recall that we used the fact that $\boldsymbol{X}^T\boldsymbol{X}$ , as a symmetric $n\times n$ matrix, can be written as $\boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{QDQ}^T$ , where $\boldsymbol{Q}$ is an $n\times n$ orthogonal matrix and $\boldsymbol{D}$ is an $n\times n$ diagonal matrix. (Here $\boldsymbol{X}$ is the data matrix, with each of the $m$ rows representing a data point in $\mathbb{R}^n$ ). Consider the SVD $\boldsymbol{X}=\boldsymbol{USV}^T$ and what its existence means for $\boldsymbol{X}^T\boldsymbol{X}$ :

(12) $\begin{equation*} \boldsymbol{X}^T\boldsymbol{X}=(\boldsymbol{USV}^T)^T(\boldsymbol{USV}^T)=\boldsymbol{VS}^T\boldsymbol{U}^T\boldsymbol{USV}^T=\boldsymbol{VDV}^T, \end{equation*}$

where $\boldsymbol{D}$ is a diagonal matrix with diagonal entries equal to the squares of the diagonal entries of $\boldsymbol{S}$ (if $m<n$ then the remaining $n-m$ diagonal entries of $\boldsymbol{D}$ are 0).

Recall from last lecture that if we decompose $\boldsymbol{X}^T\boldsymbol{X}$ as $\boldsymbol{QDQ}^T$ , then the rows of $\boldsymbol{Q}^T$ are the eigenvectors of $\boldsymbol{X}^T\boldsymbol{X}$ . The computation in (18) therefore shows that the rows of $\boldsymbol{V}^T$ are the eigenvectors of $\boldsymbol{X}^T\boldsymbol{X}$ . Thus, the right singular vectors of $\boldsymbol{X}$ are the same as the eigenvectors of $\boldsymbol{X}^T\boldsymbol{X}$ . Similarly, the eigenvalues of $\boldsymbol{X}^T\boldsymbol{X}$ are the squares of the singular values of $\boldsymbol{X}$ .

Thus PCA reduces to computing the SVD of $\boldsymbol{A}$ (without forming $\boldsymbol{X}^T\boldsymbol{X}$ ). Recall that the output of PCA, given a target $k$ , is simply the top $k$ eigenvectors of the covariance matrix $\boldsymbol{X}^T\boldsymbol{X}$ . The SVD $\boldsymbol{USV}^T$ of $\boldsymbol{X}$ hands you these eigenvectors on a silver platter they are simply the first $k$ rows of $\boldsymbol{V}^T$ . This is an alternative to the Power Iteration method discussed in the previous section. So which method for computing eigenvectors is better? There is no clear answer; in many cases, either should work fine, and if performance is critical you’ll want to experiment with both. Certainly the Power Iteration method, which finds the eigenvectors of $\boldsymbol{X}^T\boldsymbol{X}$ one-by-one, looks like a good idea when you only want the top few eigenvectors (as in our data visualization use cases). If you want many or all of them, then the SVD – which gives you all of the eigenvectors, whether you want them or not – is probably the first thing to try.

Now that we understand the close connection between the SVD and the PCA method, let’s return to Fact 1, which states that the SVD-based rank- $k$ approximation is optimal (with respect to the Frobenius norm). Intuitively, this fact holds because: (i) minimizing the Frobenius norm $\|\boldsymbol{A}-\boldsymbol{B}\|_F$ is equivalent to minimizing the average (over i) of the squared Euclidean distances between the $i$ th rows of $\boldsymbol{A}$ and $\boldsymbol{B}$ ; (ii) the SVD uses the same vectors to approximate the rows of $\boldsymbol{A}$ as PCA (the top eigenvectors of $\boldsymbol{A}^T\boldsymbol{A}$ /right singular vectors of $\boldsymbol{A}$ ); and (iii) PCA, by definition, chooses its $k$ vectors to minimize the average squared Euclidean distance between the rows of $\boldsymbol{A}$ and the $k$ -dimensional subspace of linear combinations of these vectors. The contribution of a row of $\boldsymbol{A}-\boldsymbol{A_k}$ to the Frobenius norm corresponds exactly to one of these squared Euclidean distances.

2.3.7 More on PCA vs. SVD

PCA and SVD are closely related, and in data analysis circles you should be ready for the terms to be used almost interchangeably. There are differences, however. First, PCA refers to a data analysis approach – a choice of how to define the “best” way to approximate a bunch of data points as linear combinations of a small set of vectors. Meanwhile, the SVD is a general operation defined on all matrices. For example, it doesn’t really make sense to talk about “applying PCA” to a matrix $\boldsymbol{A}$ unless the rows of $\boldsymbol{A}$ have clear semantics – typically, as data points $\boldsymbol{x_1}, \dots, \boldsymbol{x_m} \in \mathbb{R}^n$ . By contrast, the SVD (14) is well defined for every matrix $\boldsymbol{A}$ , whatever the semantics for $\boldsymbol{A}$ . In the particular case where $\boldsymbol{A}$ is a matrix where the rows represent data points, the SVD can be interpreted as performing the calculations required by PCA (The SVD is also useful for many other computational tasks).

We can also make more of an “apples vs. apples” comparison in the following way. Let’s define the “PCA operation” as taking an $m\times n$ data matrix $\boldsymbol{X}$ as input, and possibly a parameter $k$ , and outputting all (or the top $k$ ) eigenvectors of the covariance matrix $\boldsymbol{X}^T\boldsymbol{X}$ . The “SVD operation” takes as input an $m\times n$ matrix $\boldsymbol{X}$ and outputs $\boldsymbol{U}$ , $\boldsymbol{S}$ , and $\boldsymbol{V}^T$ , where the rows of $\boldsymbol{V}^T$ are the eigenvectors of $\boldsymbol{X}^T\boldsymbol{X}$ . Thus the SVD gives strictly more information than required by PCA, namely the matrix $\boldsymbol{U}$ .

Is the additional information $\boldsymbol{U}$ provided by the SVD useful? In applications where you want to understand the column structure of $\boldsymbol{X}$ , in addition to the row structure, the answer is “yes.” To see this, let’s review some interpretations of the SVD (14). On the one hand, the decomposition expresses every row of $\boldsymbol{X}$ as a linear combinations of the rows of $\boldsymbol{V}^T$ , with the rows of $\boldsymbol{US}$ providing the coefficients of these linear combinations. That is, we can interpret the rows of $\boldsymbol{X}$ in terms of the rows of $\boldsymbol{V}^T$ , which is useful when the rows of $\boldsymbol{V}^T$ have interesting semantics. Analogously, the decomposition in (14) expresses the columns of $\boldsymbol{X}$ as linear combinations of the columns of $\boldsymbol{U}$ , with the coefficients given by the columns of $\boldsymbol{SV}^T$ . So when the columns of $\boldsymbol{U}$ are interpretable, the decomposition gives us a way to understand the columns of $\boldsymbol{X}$ .

In some applications, we really only care about understanding the rows of $\boldsymbol{X}$ , and the extra information $\boldsymbol{U}$ provided by the SVD over PCA is irrelevant. In other applications, both the rows and the columns of $\boldsymbol{X}$ are interesting in their own right. For example:

Suppose rows of $\boldsymbol{X}$ are indexed by customers, and the columns by products, with the matrix entries indicating who likes what. We are interested in understanding the rows, and in the best-case scenario, the right singular vectors (rows of $\boldsymbol{V}^T$ ) are interpretable as “customer types” or “canonical customers” and the SVD expresses each customer as a mixture of customer types. For example, perhaps each student’s purchasing history can be understood simply as a mixture of a “CS customer,” a “music customer”, and a “clothing customer.” In the ideal case, the left singular vectors (columns of $\boldsymbol{U}$ ) can be interpreted as “product types”, where the “types” are the same as for customers, and the SVD expresses each product as a mixture of product types (the extent to which a product appeals to a “CS customer”, a “music customer”, etc).
Suppose the matrix represents data about drug interactions, with the rows of $\boldsymbol{X}$ indexed by proteins or pathways, and the columns by chemicals or drugs. We’re interested in understanding both proteins and drugs in their own right, as mixtures of a small set of “basic types.”

In the above two examples, what we really care about is the relationships between two groups of objects – customers and products, or proteins and drugs – the labeling of one group as the “rows” of a matrix and the other as the “columns” is arbitrary. In such cases, you should immediately think of the SVD as a potential tool for better understanding the data. When the columns of $\boldsymbol{X}$ are not interesting in their own right, PCA already provides the relevant information.

. . .

This concludes the second chapter in this series which deeply tackled the mathematics behind the Power Iteration and the SVD methods for computing the principal component vectors for PCA. In the next post we’ll apply everything we’ve learned in this post to make a face recognition model using the PCA method. We’ll apply the model (which uses PCA and SVM) using Python, so a basic understanding of code writing is required for this next chapter. Do note that unlike PCA, SVM is a classification algorithm which we will discuss in another post.

7500cookie-checkPrincipal Component Analysis | Chapter 2: The Actual Workings of PCA

Principal Component Analysis | Chapter 2: The Actual Workings of PCA

2 The Actual Workings of PCA