Principal Component Analysis | Chapter 1: Understanding PCA

Some high dimensional values reduced to two dimensions. Courtesy of GISGeography.

In this three-part series of posts, we will discuss a simple machine learning algorithm called Principal Component Analysis (PCA), which can be explained using only knowledge of linear algebra. This algorithm is considered simple and easy to understand, since the reader only requires basic knowledge in one sub field of mathematics. This topic will be divided into three chapters, they are designed to be read independently. So the reader can decide to read the entire series of posts, or read individual chapters depending on the reader’s needs/background. Please note that the reader is assumed to have at least a high school level of mathematical knowledge for understanding any of the above three posts. The three chapters are as follows:

Chapter 1: Understanding Principal Component Analysis. This post is written for readers who have little to no knowledge of linear algebra.
Chapter 2: The Actual Workings of PCA. This post is written for readers for readers with knowledge in linear algebra. It rigorously explains two different mathematical methods that power PCA: the Power Iteration Method, and the Singular Vector Decomposition (SVD).
Chapter 3: Eigenfaces using PCA. This post is written for readers with some programming experience. The chapter will discuss the eigenfaces technique that uses PCA, and will explain a code example written in python.

But of course, to understand the full picture and understand every nuke and cranny of the PCA algorithm, it is advised that the reader starts off from chapter 1 and work their way up to chapter 3.

. . .

1 Understanding Principal Component Analysis

The simplest way to understanding principal component analysis (PCA) is by an example. So the following example describes one of the many problems that we might want to apply PCA to our data set. It portrays the main reasons why one chooses to use PCA: to visualize the data in a lower dimensional space, to understand the sources of variability in the data, and to understand the correlations (if any) between the different coordinates of the data points.

1.1 An Introductory Example

Suppose that you are given the task to interpret the data given by a survey of people’s ratings of four different foods. The rating is on a scale of 1 to 10, 1 being the lowest and 10 being the highest. The different foods are Taco Salad, Hamburger, Pepperoni Pizza, and Strawberry Tart. The survey is given as follows:

	Taco Salad	Hamburger	Pepperoni Pizza	Strawberry Tart
John	10	1	2	7
Rosie	7	2	1	10
Steven	2	9	7	3
Jessie	3	6	10	2

Let m = 4 be the number of data points (rows), and n = 4 be the dimensions (columns).

Now looking at the above table, would we be able to usefully visualize the data in less than 4 dimensions? And by doing so, would we be able to understand the forces at work behind the differences in opinion?

The key observation here is that the data points can be approximately expressed as

(1) $\begin{equation*} \boldsymbol{\bar{x}}+c_1\boldsymbol{v_1}+c_2 \boldsymbol{v_2} , \end{equation*}$

where $\boldsymbol{\bar{x}}=(5.5, 4.5, 5, 5.5)$ is the average of the data points, $\boldsymbol{v_1}=(3, -3, -3, 3)$ , $\boldsymbol{v_2}=(1, -1, 1, -1)$ , and the scalar vector $(c_1, c_2)$ is $(1, 1)$ for John, $(1, -1)$ for Rosie, $(-1, -1)$ Steven, and $(-1, 1)$ for Jessie as shows in Figure 1.

Figure 1: Visualizing a 4-dimensional data in a 2-dimensional plane.

Approximating John’s scores using the vectors and scalars picked would be $\boldsymbol{\bar{x}}+\boldsymbol{v_1}+\boldsymbol{v_2}=(9.5, 0.5, 3, 7.5)$ . Similarly, approximating Rosie’s scores would be $\boldsymbol{\bar{x}}-\boldsymbol{v_1}-\boldsymbol{v_2}=(1.5, 8.5, 7, 3.5)$ , and so on.

Such an approximation (1) provides a way to visualize the data points in a lower dimension, similar to Figure 1. Note that if we had larger number of data points, we would be inspecting clusters of similar data points instead of individual data points, which is basically the same thing.

The approximation also helps us interpret the data. The way we do that is we think of each data point having two coordinates in terms of the vectors $\boldsymbol{v_1}$ and $\boldsymbol{v_2}$ (i.e. $c_1$ and $c_2$ ). We notice that the first and fourth coordinates of $v_1$ are positively correlated since they both have large positive values. Similarly, the second and third coordinates of $\boldsymbol{v_1}$ are also positively correlated because they both have large negative values. However, the first and fourth, and the second and third coordinates are negatively correlated since they have different values.

Matching our interpretations with the real world survey given. While also noting that the taco salad and the strawberry tart are vegetarian while the burger and pepperoni pizza are not, we can interpret the $\boldsymbol{v_1}$ vector as indicative of a person’s vegetarian preference. By similar reasoning we can also interpret $\boldsymbol{v_2}$ as indicative of a person’s health-consciousness. Also noting that $\boldsymbol{v_1}$ has the larger values indicate that it has a stronger effect than $\boldsymbol{v_2}$ .

In summary, in our introductory example we have approximated and interpreted our 4-dimensional data in a 2-dimensional space using (1). The point of PCA is to automatically compute approximations of type (1) for any data set, be it small or large. The vectors $\boldsymbol{v_1}$ and $\boldsymbol{v_2}$ are called the top two principal components which are computed by PCA.

1.2 Explaining the PCA Method

1.2.1 What is PCA used for?

The PCA algorithm takes each one of the m n-dimensional vectors as a linear combination of k n-dimensional ones. Or in mathematical terms, it takes every vector $\boldsymbol{x_1}, \boldsymbol{x_2}, \dotsb , \boldsymbol{x_m} \in \mathbb{R}^n$ as a linear combination of $\boldsymbol{v_1}, \boldsymbol{v_2}, \dotsb , \boldsymbol{v_k} \in \mathbb{R}^n$ as follows

$x_i \approx \sum_{j=1}^k a_{ij}v_j$

for $i = 1, 2, \dotsb , m$ , compressing (or reducing) our $m$ data points to $k$ (where $k \leq m$ ). Hence the name given to PCA as being a dimensionality reduction algorithm. In our introductory example, we have $m = n = 4$ (the number of data points and the dimension are equal), and $k = 2$ (the top two principal components).

The vectors computed by PCA form the “best” $k$ vectors. Those $k$ vectors will form the “best fit” line (for $k = 1$ ), plane (for $k = 2$ ), or space (for $k \geq 3$ ). In our introductory example, the “best” vectors (i.e. $\boldsymbol{v_1}$ and $\boldsymbol{v_2}$ ) form the best fit 2-dimensional plane (cause $k = 2$ ).

1.2.2 Before using PCA…

It is important to “preprocess” our data before we begin using PCA to find the best-fit line(s), that is to both make the necessary linear algebra simpler and clearer. And that is achieved by centering the points $\boldsymbol{x_1}, \boldsymbol{x_2}, \dotsb , \boldsymbol{x_m}$ around the origin. That means taking the sum of the vectors to be equal to $0$ . In other words $\sum_{i=1}^m \boldsymbol{x_i}$ is the all zero vector. And this is easily enforced by subtracting each point by the “sample mean” (which is basically the mean of the sample data we have) $\boldsymbol{\bar{x}} = \frac{1}{m}\sum_{i=1}^m \boldsymbol{x}_i$ . In our introductory example the “sample mean” was $\boldsymbol{\bar{x}}=(5.5, 4.5, 5, 5.5)$ . The idea is to find the best fit line(s) for the shifted point set (the data set which we subtracted the sample mean from), then shift the line back by adding the the sample mean to get the best-fit line(s) for the original data set. Note that the best fit line(s) are referred to as the top $n$ principal component(s).

In some cases – where the unites of measurements results in a very uneven scaling of the axis – it is also best to “preprocess” the data by scaling the coordinates appropriately. Because the resulting PCA will be highly sensitive to the unites in which each coordinate is measured. For instance, changing the unite system from centimeters to inches yields the same data, and yet this change would scale up all the values in this new coordinate system, which in turn would cause new best-fit line(s) to be computed. This should be geometrically intuitive and obvious in 2-dimensions, where scaling up either or both x- and y-axis would stretch and scale the data points as shown in Figure 2.

Figure 2: Scaling the x- and y-axis gives different best fit line(s).

One way to scale the coordinate system is to take the points $\boldsymbol{x_1}, \boldsymbol{x_2}, \dotsb , \boldsymbol{x_m}$ centered around the origin and for every $j = 1, 2, \dotsb , n$ , divide the $jth$ coordinate of every point by the “sample deviation” in that coordinate. In other words, this is written mathematically as $\sqrt{ \sum_{i=1}^m x_{ij}^2}$ .

1.2.3 Best fitting one line (for $\boldsymbol{k = 1}$ )

As discussed above, the goal of PCA is to best fit a line or multiple lines to a data set (depending on how many dimensions we want to reduce our data set). We’ll start off by discussing the single line fitting method first, which can then easily generalized to best fitting multiple lines discussed next (for $k > 1$ ).

PCA chooses the best fit line as the one that minimizes the average squared Euclidean distance between the line and the data points as follows:

(2) $\begin{equation*} \argmin_{\boldsymbol{v}:\|\boldsymbol{v}\|=1} \frac{1}{m}\sum_{i=1}^m((\textrm{distance between }\boldsymbol{x_i}\textrm{ and the line spanned by }\boldsymbol{v})^2). \end{equation*}$

Note that the line we’re identifying passes through the origin in the same direction as the “unit vector” $\boldsymbol{v}$ . A “unit vector” is a vector that has a magnitude (or size) of 1 unit (hence the name “unit vector”).

Now, you might be wondering why we’ve squared the Euclidean distance between each data point and the line before adding them up. Or you might even be wondering why did we square them at all. And the reason for that is the following: In linear algebra, the “inner product” – denoted by $\langle \boldsymbol{x_i}, \boldsymbol{v} \rangle$ – is the length of the projection of $\boldsymbol{x_i}$ onto the line spanned by $\boldsymbol{v}$ as shown in Figure 3:

Figure 3: The geometry of the inner product.

Note that the “inner product” is basically the same as the dot product, but is defined on a “vector space”. The inner product of two vectors is calculated as $\boldsymbol{x_i}^T\boldsymbol{v}$ , where taking the transpose of a vector (denoted as $\boldsymbol{x_i}^T$ ) means that the vector $\boldsymbol{x_i}$ is written as a line instead of a column or vice versa depending on how the vector is initially written.

Recall that the Pythagorean Theorem states the following: for a right triangle with sides $a$ and $b$ and hypotenuse $c$ , $a^2+b^2=c^2$ . Instantiating this for the right triangle shown in Figure 3 we have

(3) $\begin{equation*}(\textrm{dist}(\boldsymbol{x_i},\textrm{line}))^2+\langle \boldsymbol{x_i},\boldsymbol{v} \rangle^2=\|\boldsymbol{x_i}\|^2.\end{equation*}$

The right hand side of (3) is a constant, hence independent of the choice of the line $\boldsymbol{v}$ . Thus, there is a zero-sum game between the squared distance between a point $\boldsymbol{x_i}$ and the line spanned by $\boldsymbol{v}$ and the squared length of the projection of $\boldsymbol{x_i}$ on this line, in other words, making one of these quantities bigger makes the other one smaller. This implies that the “objective function” of maximizing the squared projections – the “variance” (which is the average of the squared differences from the mean) of the projection of the point set – is equivalent to the original objective in (2):

$\label{eq:linear} \argmax_{\boldsymbol{v}:\|\boldsymbol{v}\|=1} \frac{1}{m}\sum_{i=1}^m\langle \boldsymbol{x_i},\boldsymbol{v} \rangle ^2.$

The objective function of maximizing variance is natural in its own right, and the fact that PCA’s objective function admits multiple interpretations builds confidence that it’s performing a fundamental operation on the data. For example, imagine the data points fall into two well separated clusters. See Figure 4 for an illustration, but imagine a highdimensional version that can’t be so easily eyeballed.

Figure 4: For the good line, the projection of the points onto the line keeps the two clusters separated, while the projection onto the bad line merges the two clusters.

In this case, maximizing variance corresponds to preserving the separation between the two clusters post-projection. With a poorly chosen line, the two clusters would project on top of one another, thus obscuring the structure in the data. PCA effectively assumes that variance in the data corresponds to interesting information. One can imagine scenarios where such variance corresponds to noise rather than signal, in which case PCA may not produce illuminating results.

1.2.4 Best fitting multiple lines (for general $\boldsymbol{k}$ )

The discussion so far focused on the $k = 1$ case (fitting a line to the data), but the case of larger $k$ is almost the same. For general $k$ , the objective functions of minimizing the squares of distances to a $k$ -dimensional subspace and of maximizing the variance of the projections onto a $k$ -dimensional subspace (a subspace is a space that is wholly contained in another space, or whose points or elements are all in another space) are again equivalent. So the PCA objective is now

(4) $\begin{equation*} \argmax_{k\textrm{-dimensional subspace }S} \frac{1}{m}\sum_{i=1}^m(\textrm{length of }\boldsymbol{x_i}\textrm{'s projection onto }S)^2. \end{equation*}$

To rephrase this objective in a more convenient form, note the following basic linear algebra. First, vectors $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}$ are “orthonormal” if they all have unit length ( $\|\boldsymbol{v_i}\|=1$ for all $i$ ) and are orthogonal ( $\langle \boldsymbol{v_i},\boldsymbol{v_j} \rangle=0$ for all $i \neq j$ ). For example, the “standard basis vectors” (which are vectors of the form ( $0, 0, \dotsb , 0, 1, 0, \dotsb , 0$ ), i.e. the unit vectors $\boldsymbol{i}$ , $\boldsymbol{j}$ , and $\boldsymbol{k}$ that make up the in x-, y-, and z-axis in calculus) are orthonormal. Rotating these vectors gives more set of orthonormal vectors.

The span of a collection $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k} \in \mathbb{R}^n$ of vectors is all of their linear combinations: $\{\sum_{j=1}^k \lambda_j \boldsymbol{v_j} : \lambda_1, \dotsb , \lambda_k \in \mathbb{R}^n \}$ . If $k = 1$ , then this span is a line through the origin; if $k = 2$ and $\boldsymbol{v_1}$ , $\boldsymbol{v_2}$ are linearly independent (i.e. not multiples of each other) then the span is a plane (through the origin); and so on.

One nice fact about orthonormal vectors $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}$ is that the squared projection length of a point onto the subspace spanned by the vectors is just the sum of the squares of its projections onto each of the vectors:

(5) $\begin{equation*}(\textrm{length of }\boldsymbol{x_i}\textrm{'s projection on span}(\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}))^2=\sum_{j=1}^k\langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle ^2 \end{equation*}$

For intuition, consider first the case where $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}$ are all standard basis vectors, so $\langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle$ just picks out one coordinate of $\boldsymbol{x_i}$ . This identity is not true if the $\boldsymbol{v_j}$ ’s are not orthonormal (e.g. imagine if $k=2$ and $\boldsymbol{v_1}$ , $\boldsymbol{v_2}$ are linearly independent but nearly the same vector).

Combining (4) and (5), we can state the objective of PCA for general $k$ in its standard form: compute orthonormal vectors $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k}$ to maximize

(6) $\begin{equation*}\frac{1}{m}\sum_{i=1}^m \sum_{j=1}^k\langle \boldsymbol{x_i},\boldsymbol{v_j} \rangle^2. \end{equation*}$

The resulting $k$ orthonormal vectors are called the top $k$ principal components of the data. In our introductory example, the two vectors $(3, -3, -3, 3)$ and $(1, -1, 1, -1)$ , once normalized to be unit vectors, are close to the top two principal components. The actual principal components have messier numbers so we won’t be reporting them in this post.

To recap, the formal definition of the problem solved by PCA is: given $\boldsymbol{x_1}, \dotsb , \boldsymbol{x_m} \in \mathbb{R}^n$ and a parameter $k \geq 1$ , compute orthonormal vectors $\boldsymbol{v_1}, \dotsb , \boldsymbol{v_k} \in \mathbb{R}^n$ to maximize (6).

. . .

This concludes the first chapter in this series. We have started off by giving a simple and yet practical example to help us understand how PCA reduces a the dimensionality of the data by using predictive vectors called the top principal components. Then we moved on to explain how PCA actually works, where we first discussed the preprocessing step before applying PCA. The preprocessing step consisted of centering the data points around the origin by subtracting the sample mean. Then we discussed how to find the first principal component, which is basically fitting the best fit line as the one that minimizes the average squared Euclidean distance between the line and the data points. Which we then saw that this minimization implies the maximization of the squared projections. And finally we generalized this idea to finding multiple top principal components.

In the next post, we will rigorously explain how the principal components are actually computed mathematically using two different ways: the Iteration Method and the Singular Vector Decomposition Method.

51610cookie-checkPrincipal Component Analysis | Chapter 1: Understanding PCA

Principal Component Analysis | Chapter 1: Understanding PCA