Linear Algebra Notes

Notes on some linear algebra topics. Sometimes I find things are just stated as being obvious, for example, the dot product of two orthogonal vectors is zero. Well... why is this? The result was a pretty simple but nonetheless it took a little while to think it through properly. Hence, these notes... they started of as my musings on the "why", even if the "why" might be obvious to the rest of the world!

A lot of these notes are now just based on the amazing Khan Academy linear algebra tutorials. In fact, id say the vast majority are... they are literally just notes on Sal's lectures!

Another resource that is really worth watching is 3 Blue 1 Brown's series "The essence of linear algebra".

Page Contents

Vector Stuff

Vector Spaces

References:

A vector is an element of a vector space. A vector space, also known as a linear space, V , is a set which has addition and scalar multiplication defined for it and satisfies the following for any vectors \vec u , \vec v , and \vec w and scalars c and d :

1 Closed under addition: \vec u + \vec v \in V
2 Communative: \vec u + \vec v = \vec v + \vec u
3 Associative: (\vec u + \vec v) + \vec w = \vec u + (\vec v + \vec w)
4 Additive identity: \vec 0 \in U, \text{i.e.,} \vec u + \vec 0 = \vec u
5 Inverse: \forall \vec u \in V, \exists -\vec u | \vec u + (-\vec u) = \vec 0
6 Closed under scalar multiply: c\vec u \in V
7 Distributive: c(\vec u + \vec v) = c\vec u + c\vec v
8 Distributive: (c + d)\vec u = c\vec u + d\vec u
9 Distributive: c(d\vec u) = (cd)\vec u
10 Multiply identity: 1 \vec u = \vec u

Functions Spaces

First let's properly define a function. If X and Y are sets and we have x \in X and y \in Y , we can form a set F that consists of ordered pairs (x_i, y_j) . If it is the case that (x_1, y_1) \in F \cap (x_1, y_2) \in F \implies y_1 == y_2 , then F is a function and we write y = F(x) . This is called being "single valued", i.e., each input to the function is unambiguous. Note, F(x) is not a function, it is the value returned by the function: F is the function that defines the mapping from elements in X (the domain) to the elements in Y (the range).

One example of a vector space is the set of real-valued functions of a real variable, defined on the domain [a \le x \le b] . I.e., we're talking about \{f_1, f_2, \ldots, f_n\} where each f_i is a function. In other words, f_i : \mathbb{R} \rightarrow \mathbb{R} with the aforementioned restriction on the domain.

We've said that a vector is any element in a vector space, but usually we thing of this in terms of, say, [x, y] \in \mathbb{R}^2 . We can also however, given the above, think of functions as vectors: the function-as-a-vector being the "vector" of ordered pairs that make up the mapping between domain and range. So in this case the set of function is still like a set of vectors and with scalar multiplication and vector addition defined appropriately the rules for a vector space still hold!

What Isn't A Vector Space

It sounded a lot like everything was a vector space to me! So I did a quick bit of googling and found the paper "Things that are/aren't vector spaces". It does a really good example of showing how in maths we might build up a model of something and expect it to behave "normally" (i.e., obeys the vector space axioms), but in fact doesn't, in which case it becomes a lot harder to reason about them. This is why vectors spaces are nice... we can reason about them using logic and operations we are very familiar with and feel intuitive.

Linear Combinations

A linear combination of vectors is one that only uses the addition and scalar multiplication of those variables. So, given two vectors, \vec a and \vec b the following are linear combinations: 0\vec a + 0\vec b \\ 5\vec a - 2.5\vec b \\ -123\vec a + \pi\vec b This can be generalise to say that for any set of vectors \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}\} \text{ for } \vec{v_1}, \vec{v_2}, \ldots, \vec{v_n} \in \mathbb{R}^m that a linear combination of those vectors is c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_n\vec{v_n} \text{ for } c_i \in \mathbb{R} .

Spans

graph showing span of two vectors in 2D space The set of all linear combinations of a set of vectors \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}\} is called the span of those vectors: \mathrm{span}(\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}) = \{c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_n\vec{v_n} \;|\; c_i \in \mathbb{R}\} The span can also be refered to as the generating set of vectors for a subspace.

In the graph to the right there are two vectors \vec a and \vec b . The dotted blue lines show all the possible scalings of the vectors. The vectors span those lines. The green lines show two particular scalings of \vec a added to all the possible scalings of \vec b . One can use this to imagine that if we added all the possible scalings of \vec a to all the possible scalings of \vec b we would reach every coordinate in \mathbb{R}^2 . Thus, we can say that the set of vectors \{\vec a, \vec b\} spans \mathbb{R}^2 , or \mathrm{span}(\vec a, \vec b) = \mathbb{R}^2 .

From this we may be tempted to say that any set of two, 2d, vectors spans \mathbb{R}^2 but we would be wrong. Take for example \vec a and \vec b = -\alpha\vec a . These two variables are co-linear. Because they span the same line, no combination of them can ever move off that line. Therefore they do not span \mathbb{R}^2 .

Lets write the above out a little more thoroughly... \vec a = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} , \vec b = \begin{bmatrix} -\alpha a_1 \\ -\alpha a_2 \end{bmatrix} The span of these vectors is, by our above definition: \mathrm{span}(\vec a, \vec b) = \left\{ c_1 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} + c_2 \begin{bmatrix} -\alpha a_1 \\ -\alpha a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} Which we can re-write as: \begin{align} \mathrm{span}(\vec a, \vec b) &= \left\{ c_1 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} -\alpha c_2 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} \\ &= \left\{ (c_1 -\alpha c_2) \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} \\ \end{align} ... which we can clearly see is just the set of all scaled vectors of \vec a . Plainly, any co-linear set of 2d vectors will not span \mathbb{R}^2 .

Linear Independence

We saw above that the set of vectors \{\vec a, -\alpha\vec a\} are co-linear and thus did not span \mathbb{R}^2 . The reason for this is that they are linealy dependent, which means that one or more vectors in the set are just linear combinations of other vectors in the set.

For example, the set of vectors \{[1, 0], [0, 1], [2, 3]\} spans \mathbb{R}^2 , but [2, 3] is redundant because we can remove it from the set of vectors and still have a set that spans \mathbb{R}^2 . In other words, it adds no new information or directionality to our set.

A set of vectors will be said to be linearly dependent when: c_1\vec{v_1} + \cdots + c_n\vec{v_n} = \vec 0 \mid \exists \{c_i \mid c_i \ne 0 \} Thus a set of vectors is linearly independent when the following is met: c_1\vec{v_1} + \cdots + c_n\vec{v_n} = \vec 0 \mid \forall \{c_i \mid c_i = 0 \} Sure, this is a definition, but why does this mean that the set is linearly dependent?

Picture of linearly indpendent vectors

Think of vectors in a \mathbb{R}^2 . For any two vectors that are not co-linear, there is no combination of those two vectors that can get you back to the origin unless they are both scaled by zero. The image to the left is trying to demonstrate that. If we scale \vec a , no matter how small we make that scaling, we cannot find a scaling of \vec b that when added to \vec a will get us back to the origin. Thus we cannot find a linear combination c_av_a + c_bv_b = \vec 0 where c_a and/or c_b does not equal zero.

This shows that neither vector is a linear combination of the other, because if it where, we could use a non-zero scaling of one, added to the other, to get us back to the origin.

We can also show that two vectors are linearly independent when their dot product is zero. This is covered in a latter section.

Another way of thinking about LI is that there is only one solution to the equation A\vec x = \vec 0 .

Linear Subspace

If V is a set of vectors in \mathbb{R}^n , then V is a subspace of \mathbb{R}^n if and only if:

  1. V contains the zero vector: \vec 0 \in V
  2. V is closed under scalar multiplication: \vec x \in V, \alpha \in \mathbb{R} \implies \alpha\vec x \in V
  3. V is closed under addition: \vec a \in V, \vec b \in V \implies \vec a + \vec b \in V

What these rules mean is that, using linear operations, you can't do anything to "break" out of the subspace using vectors from that subspace. Hence "closed".

Surprising, at least for me, was that the zero vector is actually a subspace of \mathbb{R}^n because the set \{\vec 0\} obeys all of the above rules.

A very useful thing is that a span is always a valid subspace. Take, for example, V = \mathrm{span}(v_1, v_2, v_3) . We can show that it is a valid subspace by checking that it objects the above 3 rules.

First, does it contain the zero vector? Recall that a span is the set of all linear combinations of the span vectors. So we know that the following linear combination is valid, and hence the span contains the zero vector. 0\vec v_1 + 0\vec v_2 + 0\vec v_3 = \vec 0 Second, is it closed under scalar multiplication? Because the span is the set of all linear combinations, for any set of scalars \{a_1, a_2, a_3\} ... a_1\vec v_1 + a_2\vec v_2 + a_3\vec v_3 = \vec x ... must also be in the subspace, by definition of a subspace!

Third, is it closed under vector addition? Well, yes, the same argument as made above would apply.

Interestingly, using the same method as above, we can see that even something trivial, like the \mathrm{span}([a_1, a_2]) , which is just a line, is a subspace of \mathbb{R}^2 .

It can also be shown that any subspace basis will have the same number of elements. We say that the dimension of a subspace is the number of elements in a basis for the subspace. So, for example, if we had a basis for our subspace consisting of 13 vectors, the dimension of the subspace would be lucky 13.

Basis Of A Subspace

Recall, V = \mathrm{span}(\vec v_1, \ldots, \vec v_n) V is a subspace of \mathbb{R}^n (where each vector \vec v_i \in \mathbb{R}^n ). We also know that a span is the set of all possible linear combinations of these vectors.

If we choose a subset of V , S , where all the vectors in S are linearly independent, an we can write any vector in V as a linear combination of vectors in S , then we say that S is a basis for V .

S acts as a "minimal" set of vectors required to span V . I.e., if you added any other vector to S it is superfluous, because it would be a linear combination of the vectors already in S . You don't add any new information.

Put more formally: Let V be a subspace of \mathbb{R}^n . A subset {\vec v_1, \vec v_2, \ldots, \vec v_k} is a basis for V iff the following are true:

  1. The vectors \vec v_1, \vec v_2, \ldots, \vec v_k span V , and
  2. They are linearly independent.

A subspace can have an infinite number of basis, but there are standard basis. For example in \mathbb{R}^2 the standard basis is \{[1, 0], [0, 1]\} . The rule of invariance of dimension states that two bases for a subspace contain the same number of vectors. The number of elements in the basis for a subspace is called the subspace's dimension, denoted as \mathrm{dim}(W) , where W is the subspace.

Thus, we can say that every subspace W of \mathbb{R}^n has a basis and \mathrm{dim}(W) \le n .

A basis is a good thing because we can break down any element in the space that the basis defines into a linear combination of basic building blocks, i.e. the basis vectors.

To find a basis use the following method:

  1. Form a matrix where the jth column vector is the jth vector w_j .
  2. Reduce the matrix to row-echelon form.
  3. Tak the column vectors that contain a pivot and these are the basis vectors.

TODO discuss the canonical vectors in R2 and R3. [1,0] , [0,1] ... but we could also have [1,0] , [1,1] .

TODO unique coefficients. implied by the linear independence

TODO most important basis are orthogonal and orthonormal.

Orthonormal are good because we can find the coefficient in the new basis easily: \alpha_k = \vec w_k \dot x

The Dot Product

Definition

This is a good reference.

The dot product is one definition for a vector multiplication that results in a scalar. Given two equal-length vectors, the dot product looks like this: \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \cdot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = a_1b_1 + a_2b_2 + \cdots + a_nb_n Written more succinctly we have: x \cdot y = \sum_1^n {x_i y_i} The dot product is important because it will keep popping up in many places. Some particular uses are to calculate the length of a vector and to show that two vectors are linearly independent.

Properties

The operation has the following properties:

Commutative: \vec v \cdot \vec w = \vec w \cdot \vec v
Distributive: (\vec v + \vec w) \cdot \vec x = \vec v \cdot \vec x + \vec w \cdot \vec x
Associative (c\vec v) \cdot \vec w = c(\vec v \cdot \vec w)
Cauchy-Schwartz Inequality \vec x \cdot \vec y \le |\vec x \cdot \vec y| \le \|\vec x\| \|\vec y\|
|\vec x \cdot \vec y| = \|\vec x\| \|\vec y\| \implies \vec x = c \vec y , where c is non-zero.
Vector Triangle Inequality \|\vec x + \vec y\| \le \|\vec x\| + \|\vec y\|

Angle Between Vectors

Dot product defines the angle between vectors

The dot product also defines the angle between vectors. Take two unit vectors \hat a and \hat b . We can calculate: \text{angle between}\ \hat a\ \text{and}\ \hat b = \theta = \theta_a - \theta_b We know that: \cos(\theta_a) = a_x \;\text{and}\; \sin(\theta_a) = a_y \cos(\theta_b) = b_x \;\text{and}\; \sin(\theta_b) = b_y And we know that: \cos(\theta) = \cos(\theta_a - \theta_b) = \cos(\theta_a)\cos(\theta_b) + \sin(\theta_a)\sin(\theta_b) Therefore, by substitution: \begin{align} \cos(\theta) &= a_xb_x + a_yb_y \\ &= a \cdot b \end{align} Thus, \cos(\theta) = a \cdot b describes the angle between unit vectors. If the vectors are not unit vectors then the angle is defined, in general, by: \cos(\theta) = \frac{1}{\|\vec a\|\|\vec b\|} \vec a \cdot \vec b

Vector Length

\vec{a} \cdot \vec{a} = \|\vec{a}\|^2 \\ \therefore \|\vec{a}\| = \sqrt{\vec{a} \cdot \vec{a}}

Vector Transposes And The Dot Product

In linear algebra vectors are implicitly column vectors. I.e., when we talk about \vec{v} we mean: \vec{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} Thus, the transpose of \vec{v} is a row vector: \vec{v}^T = \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} We can then do vector multiplication of the two like so: \begin{align} \vec{v}^T\vec{v} &= \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \\ &= v_1\cdot v_1 + v_2\cdot v_2 + \cdots + v_n\cdot v_n \\ &= \vec{v}\cdot\vec{v} \end{align} So, in general \vec{v}^T\vec{w} = \vec{v}\cdot\vec{w} = \vec{w}^T\vec{v} .

Dot Product With Matrix Vector Product

If A is an mxn matrix and \vec{x} \in \mathbb{R}^n then A\vec{x} \in \mathbb{R}^m .

Then, if \vec{y} \in \mathbb{R}^m then (A\vec{v})\cdot\vec{y} is also well defined and: \begin{align} (A\vec{v})\cdot\vec{y} &= (A\vec{x})^T\vec{y} \blacktriangleleft\ \text{See section Vector Transposes And The Dot Product} \\ &= \vec{x}A^T\vec{y} \blacktriangleleft\ \text{See matrix transpose rule }\ (A\vec{X})^T = \vec{x}^TA^T\\ &= \vec{x}^T(A^T\vec{y})\\ &= \vec{x}\cdot(A^T\vec{y}) \end{align} Thus we can say (A\vec{x})\cdot\vec{y} = \vec{x}\cdot(A^T\vec{y}) .

Unit Length Vectors

A unit vector is a vector that has a length of 1. If \hat u is a vector (notice the special symbol used) then we defined its length as: \mathrm{length}(\hat u) = \|\hat u\| = \sqrt{u_1^2 + u_2^2 + \cdots + u_n^2} Thus if a vector is a unit vector then \|\hat u\| = 1 .

Interesting we can also write \|\vec u\|^2 = u \dot u .

To construct a unit vector where \sqrt{u_1^2 + u_2^2 + \cdots + u_n^2} = 1 , we need the sum u_1^2 + u_2^2 + \cdots + u_n^2 to also be 1! How do we achieve this if \|\vec u\| \ne 1 ? We do the following: \|\hat{u}\| = \frac{1}{\|\vec u\|}\vec u \ \text{, where} \ \|\vec u\| \ne 0

Orthogonal Vectors and Linear Independence

Image showing 3 perpendicular vectors
Figure 1

Orthogonal vectors are vectors that are perpendicular to each other. In a standard 2D or 3D graph, this means that they are at right angles to each other and we can visualise them as seen in Figure 1: \vec x is at 90 degrees to both \vec y and \vec z . \vec y is at 90 degrees to \vec x and \vec z , and \vec z is at 90 degrees to \vec y and \vec z .

In any set of any vectors, [\vec{a_1}, ..., \vec{a_n}] , the vectors will be linearly independent if every vector in the set is orthogonal to every other vector in the set. Note, however that whilst an orthogonal set is L.I., a L.I. set is not necessarily an orthogonal set. Take the following example: \vec a = \begin{bmatrix} 1 \\ 1 \end{bmatrix},\ \vec b = \begin{bmatrix} 1 \\ 1 \end{bmatrix} The vectors \vec a and \vec b are L.I. because nethier can be represented as a linear combination of the other. They are not however orthogonal.

So, how do we tell if one vector is orthogonal to another? The answer is the dot product which is defined as follows. x \cdot y = \sum_1^n {x_i y_i}

We know when two vectors are orthogonal when their dot product is zero: x \cdot y = 0 \implies x and y are orthogonal. You can scale either or both vectors and their dot product will still be zero!

But why is this the case? Lets imagine any two arbitrary vectors, each on the circumference of a unit circle (so we know that they have the same length and are therefore a proper rotation of a vector around the center of the circle). This is shown in figure 2. From the figure, we know the following:

Picture showing a vector being rotated
Figure 2

\begin{align} x_1 &= r \cos{\theta_1}& y_1 &= r \sin{\theta_1} \\ x_2 &= r \cos({\theta_1 + \theta_n}) & y_2 &= r \sin({\theta_1 + \theta_n}) \end{align} The vector x_2 is the vector x_1 rotated by \theta_n degrees. We can use the following trig identities: \begin{align} \sin(a \pm b)& = \sin(a)\cos(b) \pm \cos(a)\sin(b) \\ \cos(a \pm b)& = \cos(a)\cos(b) \mp \sin(a)\sin(b) \end{align} Substitute these into the formulas above, we get the following. \begin{align} x_2 &= r\cos\theta_1\cos\theta_n - r\sin\theta_1\sin\theta_n \\ y_2 &= r\sin\theta_1\cos\theta_n + r\cos\theta_1\sin\theta_n \\ \end{align} Which means that... \begin{align} x_2 &= x_1\cos\theta_n - y_1\sin\theta_n \\ y_2 &= y_1\cos\theta_n + x_1\sin\theta_n \end{align} For a 90 degree rotation \theta_n = 90^\circ we know that \cos\theta_n = 0 and \sin\theta_n = 1 . Substiuting these values into the above equations we can clearly see that... \begin{align} x_2 &= -y_1 \\ y_2 &= x_1 \end{align} Therefore, any vector in 2D space that is [a, b] will become [-b, a] and therefore the dot product becomes -ab + ab = 0 . And voilà, we know know why the dot product of two orthogonal 2D vectors is zero. I'm happy to take it on faith that this extrapolates into n dimensions :)

Another way of looking at this is to consider the law of cosines: C^2 = A^2 + B^2 - 2AB\cos(c) :

Law of cosines

Here A and V represent the magnitude of the vectors \vec x , \|\vec x\| , and \vec y , \|\vec y\| . C represents the magnitude of the vector \|\vec x - \vec y\| . Thus, we have this:

Law of cosines

I.e., \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 - 2\|\vec x\|\|\vec y\|\cos(\theta_c) .

When the angle between the vectors is 90 degrees, \cos(90) = 0 and so that part of the equation disappears to give us \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 . Now... \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 \\ \therefore \|\vec x\|^2 + \|\vec y\|^2 - \|\vec x - \vec y\|^2 = 0 Recall that \|\vec v\| = \sqrt{v_1^2 + v_2^2 + ... + v_n^2} and that therefore \|\vec v\|^2 = v_1^2 + v_2^2 + ... + v_n^2 .

Based, on this we can say that... \begin{align} \|\vec x - \vec y\|^2 &= (x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2 \\ &= x_1^2 - 2x_1y_1 + y_1^2 + x_2^2 - 2x_2y_2 + y_2^2 + ... + x_n^2 - 2x_ny_n + y_n^2 \end{align} Now, all of the x_i^2 and y_i^2 terms above will be cancelled out by the \|\vec x\|^2 and \|\vec y\|^2 terms, leaving us with... \|\vec x\|^2 + \|\vec y\|^2 - \|\vec x - \vec y\|^2 = 2x_1y_1 + 2x_2y_2 + ... + 2x_ny_n = 0 Which is just twice the dot product when the vectors are at 90 degrees - or just the dot product because we can divide through by 2.

Matrices

Properties

Khan Academy has great articles. This is another good reference from MIT.

Scalar Multiplication

  1. Associative: (cd)A=c(dA)
  2. Distributive: c(A + B)=cA + cB and (c + d)A = cA + dA
  3. Identity: 1A = A
  4. Zero: 0A = O
  5. Closure: cA is a matrix of the same dimensions as A .

Note that scalar multiplication is not commutative: \color{red}{cA \ne Ac} !

Matrix Addition

  1. Commutative: A+B=B+A
  2. Associative: A+(B+C)=(A+B)+C
  3. Zero: For any matrix A , there is a unique matrix O such that A+O=A
  4. Inverse: \forall A,\ \exists! −A\ \mid\ A+(−A)=O
  5. Closure: A+B , is a matrix of the same dimensions as A and B .

Matrix Multiplication

  1. Associative: (AB)C=A(BC)
  2. Distributive: A(B+C)=AB+AC
  3. Identity: IA=A
  4. Zero: OA=O
  5. Inverse: Only if A is a square matrix, then A^{-1} is its inverse and AA^{-1} = A^{-1}A = I . Note that not all matrices are invertible.
  6. Dimension: If A is an m \times n matrix and B is an n \times k matrix then AB is an m \times k matrix.

Note that matrix multiplication is not commutative: \color{red}{AB \ne BA} !

Be careful: the equation AB = 0 does not necessarily mean that A = 0 or B = 0 and for a general matrix A , AB = AC \nRightarrow B = C , unless A is invertible.

Matrix Transpose

  1. Associative: (rA)^T = rA^T, r \in \mathbb{R}
  2. Distributive (into addition): (A + B)^T = A^T + B^T
  3. Distributive (into mult): (AB)^{T} = B^{T}A^{T}\ \color{red}{\blacktriangleleft \text{Note the reversal of terms}}
    Generalises to (ABC)^T = C^TB^TA^T
  4. Identity: I^T = I
  5. Inverse: (A^T)^T = A
  6. Name?: (A^T)^{-1} = (A^{-1})^T
  7. Name?: (A\vec x)^T = \vec x^TA^T\ \color{red}{\blacktriangleleft \text{Note the reversal of terms}}
  8. Name?: \mathrm{rank}(A) = \mathrm{rank}(A^T)
  9. If columns of A are LI then columns of A^TA are LI. A^TA is square and also invertible.

Matrix Inverse

  1. Associative (self with inverse): AA^{−1} = A^{−1}A = I
  2. Distributive (into scalar mult): (rA)^{−1} = r^{−1}A^{−1},\ r \ne 0
  3. Distributive (into mult): (AB)^{-1} = B^{-1}A^{-1}\ \color{red}{\blacktriangleleft \text{Note the reversal of terms}}
  4. Identity: I^{-1} = I
  5. Name?: (A^{-1})^{-1} = A
  6. Name?: (A^T)^{-1} = (A^{-1})^T

Matricies, Simultaneous Linear Equations And Reduced Row Echelon Form

References:

The set of linear equations: a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = b_1 a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = b_2 ... a_{m1}x_1 + a_{m2}x_2 + ... + a_{mn}x_n = b_m Can be represented in matrix form: \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \\ \end{bmatrix} Any vector \vec x , where A\vec x = \vec b , is a solution to the set of simultaneous linear equations. How can we solve this from the matrix. We create an augmented or patitioned matrix: \left[ \begin{array}{cccc|c} 1a_{11} & a_{12} & \cdots & a_{1n} & b_1\\ a_{21} & a_{22} & \cdots & a_{2n} & b_2 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} & b_m\\ \end{array} \right] Next this matrix must be converted to row echelon form. This is where each row has the same or less zeros to the left than the row below. For example, the following matrix is in row echelon form: \begin{bmatrix} 1 & 2 & 3 & 4 \\ 0 & 1 & 2 & 3 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 2 \\ \end{bmatrix} But, this next one is not: \begin{bmatrix} 1 & 2 & 3 & 4 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 2 \\ 0 & 1 & 2 & 3 \\ \end{bmatrix} We go from a matrix to the equivalent in row-echelon form using elementary row operations:

  1. Row interchange: R_i \leftrightarrow R_j
  2. Row scaling: R_i \rightarrow kR_i
  3. Row addition: R_i \rightarrow R_i + kR_j

In row echelon form:

  1. All rows containing only zeros appear below rows with nonzero entries, and
  2. The first nonzero entry in any row appears in a column to the right of the first nonzero entry in any preceding row.

The first non zero entry in a row is called the pivot for that row.

And woop, woop, even more important than echelon form is reduced row-echelon form (rref). This is when:

  1. If a row has all zeros then it is below all rows with a non-zero entry.
  2. The left most entry in each row is 1 - it is called the pivot.
  3. The left most non-zero entry in each row is the only non-zero in that column.
  4. Each row has more zeros to the left of the pivot than the row above.

Or, to put it another way, when it is a matrix in row-echelon form where:

  1. All pivots are equal to 1, and
  2. Each pivot is the only non-zero in its column.

Some important terms are pivot position and pivot column. The position is the location of a leading 1, which meets the above criteria, and the column is a column that contains a pivot position.

So why do we give a damn above RREF if we have row-echelon form? Well, one reason is it lets us figure out if a set of linear equations has a solution and which variables are free and which are dependent. The reason for this is that the RREF of every matrix is unique (whereas merely row reducing a matrix can have many resulting forms).

The free variable is one that you have control over, i.e. you can choose and manipulate its value. The dependent one depends on the values of the other variables in the experiment or is constant.

Pivot columns represent variables that are dependent. Why? Recall that the column vector containing a pivot is all-zeros except for the pivot. So if element j in a row of a rref matrix is a pivot, then in our system of equations x_j only appears in this one equations in our RREF matrix. Thus, its value must be determined as this is the only place it occurs. It must either be a constant or a combination of other variables.

Because column vectors containing a pivot are all-zeros except for the pivot, the set of pivot columns must also be LI (see later). You can always represent the free columns as linear combinations of the pivot columns.

If we have an augmented matrix [A | b] :

  1. No Solution. If the last (augmented) column of [A|b] (ie. the column which contains b ) contains a pivot entry.
  2. Restructed Solution Set. If the last row of the matrix is all zeros except the pivot column which is a function of b_1 \ldots b_m .
  3. Exactly One Solution. If the last (augmented) column of [A|b] (ie. the column which contains b ) does not contain a pivot entry and all of the columns of the coefficient part (ie. the columns of A ) do contain a pivot entry.
  4. Infinitely Many Solutions. If the last (augmented) column of [A|b] (ie. the column which contains b ) does not contain a pivot entry and at least one of the columns of the coefficient part (ie. the columns of A ) also does not contain a pivot entry.

Let's look at a quick example of a matrix in echelon-form: \left[ \begin{array}{ccc|c} -5 & -1 & -3 & 3 \\ 0 & 3 & 5 & 8 \\ 0 & 0 & 2 & -4 \end{array} \right]

We can get the solution to the set of simultaneous linear equations using back substitution. Directly from the above row-echelon form (useful because the last row has one pivot and no free variables) matrix we know that: -5x_1 -x_2 -3X3 = 3 \tag{1} 3x_2 + 5x_3 = 8 \tag{2} 2x_3 = -4 \tag{3} From (1) we know: x_3 = -2 \tag{4} Substitute (4) back into (2): 3x_2 + -10 = 8 \implies x_2 = 6 \tag{5} Substitute (5) and (4) back into (1): -5x_1 -6 +6 = 3 \implies x_1 = -\frac{3}{5} \tag{6}

We can use RREF to accomplish the same thing and find the solutions to the above. \left[ \begin{array}{ccc|c} -5 & -1 & -3 & 3 \\ 0 & 3 & 5 & 8 \\ 0 & 0 & 2 & -4 \end{array} \right] \rightarrow \left[ \begin{array}{ccc|c} 1 & \frac{1}{5} & -\frac{3}{5} & -\frac{3}{5} \\ 0 & 3 & 5 & 8 \\ 0 & 0 & 1 & -2 \end{array} \right] \small{(r1 = \frac{r1}{5}, r3 = \frac{r3}{2})} \rightarrow \left[ \begin{array}{ccc|c} 1 & \frac{1}{5} & -\frac{3}{5} & -\frac{3}{5} \\ 0 & 3 & 0 & 18 \\ 0 & 0 & 1 & -2 \end{array} \right] \small{(r2 = r2 - 5r3)} \rightarrow \left[ \begin{array}{ccc|c} 1 & \frac{1}{5} & -\frac{3}{5} & -\frac{3}{5} \\ 0 & 1 & 0 & 6 \\ 0 & 0 & 1 & -2 \end{array} \right] \small{(r2 = \frac{r2}{3})} \rightarrow \left[ \begin{array}{ccc|c} 1 & 0 & 0 & -\frac{3}{5} \\ 0 & 1 & 0 & 6 \\ 0 & 0 & 1 & -2 \end{array} \right] \small{(r1 = r1 + \frac{3r3}{5} \rightarrow, r1 = r1 - \frac{r2}{5})} Which means we come to the same conclusion as we did when doing back substitution, but were able to do it using elementary row operations.

We can not talk about the general solution vector, \vec x , which in this case is: \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} -\frac{3}{5} \\ 6 \\ -2 \end{bmatrix} We can also talk about the solution set, which is the set of all vectors that result in a valid solution to our system of equations. In this case our solution set is trivial: it just has the above vector in it because the RREF had pivots in every column so there is exactly one solution.

Lets, therefore, look at something more interesting. Take the RREF matrix below: \left[ \begin{array}{cccc|c} 1 & 0 & 4 & 9 & 0 \\ 0 & 1 & 8 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right] Now we have two free variables! Now x_3 = a and x_4 = b where a and b can be any values we like. Now the general solution vector looks like this: \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} -4a - 9b \\ -8a -3b \\ a \\ b \end{bmatrix} How did we get here? We we know from the RREF matrix that: x_1 + 4x_3a + 9x_4b = 0 x_2 + 8x_3a + 3x_4b = 0 We replace x_3 and x_4 with our free variables and then solve for the pivots, thus getting the general solution vector.

The above can then be represented using the general solution vector: \begin{bmatrix} -4a - 9b \\ -8a -3b \\ a \\ b \end{bmatrix} = a\begin{bmatrix} -4 \\ -8 \\ 1 \\ 0 \end{bmatrix} + b\begin{bmatrix} -9b \\ -3b \\ 0 \\ 1 \end{bmatrix} The solution set thus becomes: \mathrm{span} \left( \begin{bmatrix} -4 \\ -8 \\ 1 \\ 0 \end{bmatrix} , \begin{bmatrix} -9 \\ -3 \\ 0 \\ 1 \end{bmatrix} \right)


We can find the reduced row echelon form of a matrix in python using the sympy package and the rref() function:

from sympy import Matrix, init_printing

init_printing(use_unicode=True)

system = Matrix(( (3, 6, -2, 11), (1, 2, 1, 32), (1, -1, 1, 1) ))
system
system.rref()[0]

Which outputs:

⎡1  0  0  -17/3⎤
⎢              ⎥
⎢0  1  0  31/3 ⎥
⎢              ⎥
⎣0  0  1   17  ⎦

To then apply this to our augmented matrix we could do something like the following. Lets say we have a set of 3 simultaneous equations with 3 unknowns: 3x + 6y - 2z = 11 \\ x + 2y + z = 32 \\ x - y + z = 1 We'd make an augmented matrix like so: \left[ \begin{array}{ccc|c} 3 & 6 & -2 & 11\\ 1 & 2 & 1 & 32\\ 1 & -1 & 1 & 1 \end{array} \right] To solve this using Python we do:

from sympy import Matrix, solve_linear_system, init_printing
from sympy.abc import x, y, z

init_printing(use_unicode=True)

system = Matrix(( (3, 6, -2, 11), (1, 2, 1, 32), (1, -1, 1, 1) ))
system

solve_linear_system(system, x, y, z)

Which will give the following output:

⎡3  6   -2  11⎤
⎢             ⎥
⎢1  2   1   32⎥
⎢             ⎥
⎣1  -1  1   1 ⎦
{x: -17/3, y: 31/3, z: 17}

Matrix / Vector Multiplication

Matrix/vector multiplication is pretty straight forward: \begin{bmatrix} a & c \\ b & d \\ \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 2a + 4c \\ 2b + 4d \end{bmatrix}

The concept of matrix multiplication, the type I learnt in school, is shown in the little animation below. For each row vector of the LHS matrix we take the dot product of that vector with each column vector from the RHS matrix to produce the result:

But having said that, I'd never really thought of it as shown below until watching some stuff on Khan Academy:

\begin{bmatrix} \color{red}a & \color{blue}c \\ \color{red}b & \color{blue}d \\ \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = 2 \begin{bmatrix} \color{red}a \\ \color{red}b \end{bmatrix} + 4 \begin{bmatrix} \color{blue}c \\ \color{blue}d \end{bmatrix}

We can think of it as the linear combination of the column vectors of the matrix where the coefficients are the members of the vector, where the first column vector is multiplied by the first element, the second column vector by the second element and so on.

So, for any mxn matrix, A , and vector \vec v \in \mathbb{R}^n we can say that, A\vec v = \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vdots & \vdots & & \vdots\\ \vec{c_1} & \vec{c_2} & \cdots & \vec{c_n}\\ \vdots & \vdots & & \vdots\\ \downarrow & \downarrow & & \downarrow \end{bmatrix} \begin{bmatrix} {v_1} \\ {v_2} \\ \vdots \\ {v_n} \end{bmatrix} = v_1\vec{c_1} + v_2\vec{c_2} + \cdots + v_n\vec{c_n}

Null Space Of A Matrix

Definition

For an nxm vector A , the nullspace of the nxm matrix A , N(A) is defined as follows: N(A) = \{\vec x \in \mathbb{R}^m \mid A\vec x = \vec 0 \} In words, the null space of a matrix is the set of all solutions that result in the zero vector, i.e. the solution set to the above. We can get this set by using the normal reduction to RREF to solve the above ( [A\mid\vec 0] \longrightarrow [R \mid \vec 0] ). The solution space set to this is the null space.

The system of linear equations A\vec x = \vec 0 is called homogeneous. It is always consistent, i.e., has a solution, because \vec x = \vec 0 is always a solution (and is called the trivial solution).

The null space of a matrix is a valid subspace of \mathbb{R}^n because it satisfies the following conditions, as we saw in the section on subspaces.

  1. N(A) contains the zero vector: \vec 0 \in N(A)
  2. N(A) is closed under scalar multiplication: \vec x \in N(A), \alpha \in \mathbb{R} \implies \alpha\vec x \in N(A)
  3. N(A) is closed under addition: \vec a \in N(A), \vec b \in N(A) \implies \vec a + \vec b \in N(A)

It can also be shown that N(A) = N(\mathrm{RREF}(A)) , where \mathrm{RREF} is the reduced row echelon form.

Relation To L.I. Of Column Vectors

The nullspace of a matrix can be related to the linear independence of it's column vectors. Using the above definition of nullspace, we can write: A\vec v = \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vdots & \vdots & & \vdots\\ \vec{a_1} & \vec{a_2} & \cdots & \vec{a_n}\\ \vdots & \vdots & & \vdots\\ \downarrow & \downarrow & & \downarrow \end{bmatrix} \begin{bmatrix} {x_1} \\ {x_2} \\ \vdots \\ {x_n} \end{bmatrix} = \begin{bmatrix} 0_1 \\ 0_2 \\ \vdots \\ 0_m \end{bmatrix} Recalling how we can write out vector/scalar multipication, we get, x_1\vec{a_1} + x_2\vec{a_2} + \cdots + x_n\vec{a_n} = \vec 0 Recall from the section on linear independence that the above set of vectors is said to be linearly dependent when: x_1\vec{a_1} + x_2\vec{a_2} + \cdots + x_n\vec{a_n} = \vec 0,\ \forall \{x_i \mid x_i = 0 \} Given that our nullspace is the set of all vectors that satisfy the above, by definition, then we can say that the column vectors of a matrix A are LI if and only if N(A) = {\vec 0} .

Relation To Determinant

If the determinant of a matrix is zero then the null space will be non trivial.

Calculating The Null Space

To find the null space of a matrix A , augment it with the zero vector and convert it to RREF. Then the solution set is the null space. So, if there are only pivots, the null space only contains the zero vector because there is only one solution (which therefore must be 0).

Example #1

Let's look at a trivial example: \begin{align} \text{Let } A &= \begin{bmatrix} 4 & 1 & 1\\ 2 & 7.5 & -2\\ 6 & 1.5 & 3 \end{bmatrix} \end{align} Then the null space is defined as the solution set to A\vec x=\vec0 , i.e... \begin{bmatrix} 4 & 1 & 1 \\ 8 & 3 & 0 \\ 0 & 0 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0 \end{bmatrix} We find the solution set using the normal augmented matrix method: \left[\begin{array}{ccc|c} 4 & 1 & 1 & 0\\ 8 & 3 & 0 & 0\\ 0 & 0 & 3 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 4 & 1 & 1 & 0\\ 0 & 1 & -1 & 0\\ 0 & 0 & 3 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 4 & 1 & 1 & 0\\ 0 & 1 & -1 & 0\\ 0 & 0 & 1 & 0 \end{array}\right] \\ \mapsto \left[\begin{array}{ccc|c} 4 & 1 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 \end{array}\right] There are no free variables, so therefore the solution set can contain only the zero vector because we have three contraints x_1 = 0, x_2 = 0 and x_3 = 0 . The only values for the x_i 's that can satisfy these constrains is zero, hence the null space can only contain the zero vector.

Example #2

Lets try an consider another trivial example. \begin{align} \text{Let } A &= \begin{bmatrix} 2 & 3 & 0\\ 6 & 12 & 0\\ 8 & 15 & 0 \end{bmatrix} \end{align} It's pretty contrived because I constructed it by applying row operations to the identify matrix! Null space is defined as solutions to: \begin{bmatrix} 2 & 3 & 0\\ 6 & 12 & 0\\ 8 & 15 & 0 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0 \end{bmatrix} We find the solution set using the normal augmented matrix method: \left[\begin{array}{ccc|c} 2 & 3 & 0 & 0\\ 6 & 12 & 0 & 0\\ 8 & 15 & 0 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 \end{array}\right] Now we can see that x_1 = 0 and x_2 = 0 . The variable x_3 , however, is free. It can take any value. Let's call that value x_3 = a . The solution vector is therefore: \begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix}0 \\ 0 \\ a \end{bmatrix} The general solution vectors is therefore: a \begin{bmatrix}0 \\ 0 \\ 1 \end{bmatrix} So we can say the null space is: \mathrm{N}(A) = \mathrm{span}\left(\begin{bmatrix}0 \\ 0 \\ 1 \end{bmatrix}\right)

Example #3

Let's try one more example. This one is right out of Khan Achademy rather then my own made up rubbish :) \text{Let } A = \begin{bmatrix} 2 & -1 & -3 \\ -4 & 2 & 6 \end{bmatrix} We calculate the null space by setting up an augmented matrix to solve for A\vec x = \vec 0 : \left[\begin{array}{ccc|c} 2 & -1 & -3 & 0\\ -4 & 2 & 6 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 1 & -1/2 & -3/2 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right] We thus know that: x_1 = \frac{1}{2}x_2 + \frac{3}{2} x_3 And that x_2 and x_3 are free variables. This means that the general solution is: \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = x_2 \begin{bmatrix}\frac{1}{2}\\ 1\\ 0 \end{bmatrix} + x_3 \begin{bmatrix}\frac{3}{2}\\ 0\\ 1\end{bmatrix} And therefore we can define the null space as: \mathrm{N}(A) = \mathrm{span}\left(\begin{bmatrix}\frac{1}{2}\\ 1\\ 0 \end{bmatrix}, \begin{bmatrix}\frac{3}{2}\\ 0\\ 1\end{bmatrix}\right)

An Intuitive Meaning

This StackOverflow thread gives some really good intuitive meaning to the null space. I've quoted some of the answers verbatim here:

It's good to think of the matrix as a linear transformation; if you let h(\vec v)=Aâ‹…\vec v , then the null-space is ... the set of all vectors that are sent to the zero vector by h . Think of this as the set of vectors thatlose their identity as h is applied to them.

Let's suppose that the matrix A represents a physical system. As an example, let's assume our system is a rocket, and A is a matrix representing the directions we can go based on our thrusters. So what do the null space and the column space represent?

Well let's suppose we have a direction that we're interested in. Is it in our column space? If so, then we can move in that direction. The column space is the set of directions that we can achieve based on our thrusters. Let's suppose that we have three thrusters equally spaced around our rocket. If they're all perfectly functional then we can move in any direction. In this case our column space is the entire range. But what happens when a thruster breaks? Now we've only got two thrusters. Our linear system will have changed (the matrix A will be different), and our column space will be reduced.

What's the null space? The null space are the set of thruster intructions that completely waste fuel. They're the set of instructions where our thrusters will thrust, but the direction will not be changed at all.

Another example: Perhaps A can represent a rate of return on investments. The range are all the rates of return that are achievable. The null space are all the investments that can be made that wouldn't change the rate of return at all.

Nullity Of A Matrix

The dimension of a subspace is the number of elements in a basis for that subspace. The dimension of the null space is called the "nullity":

In the null space you will always have as many vectors as you have free variables in the RREF of the matrix. So the nullity is the number of free variables in the RREF of the matrix with which this null space is associated: \begin{align} \mathrm{nullity}(A) &= \mathrm{dim}(\mathrm{N}(A)) \\ &= \text{#non-pivot cols in}\ \mathrm{RREF}(A) \\ &= n - \mathrm{rank}(A) \end{align} This the null space has dimension, or nullity, n - \mathrm{rank}(A) . This equates to the number of free variables.

Column Space Of A Matrix

The columns space of a matrix is the set of all linear combinations of its column vectors, which is the span of the column vectors: \begin{align} \mathrm{C}(A) &= \mathrm{span}(\text{the column vectors of}\ A) \\ &= \mathrm{span}(\text{the LI column vectors of}\ A) \end{align} Note that the column vectors of A might not be LI. The set can be reduced so that only the LI vectors are used, which means the above still holds, and then we say that the LI set is a basis for \mathrm{C}(A) . The column space of A is a valid subspace.

For the system of equations A\vec x = \vec b , \mathrm{C}(A) is the set of all values that A\vec x can be. We can look at this this way: A\vec x = \begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} ax + by \\ cx + dy\end{bmatrix} = x \begin{bmatrix}a \\ c\end{bmatrix} + y \begin{bmatrix}b \\ d\end{bmatrix} = \vec b The solution, \vec b , can be seen to be a linear combination of the column vectors of A . Thus, \vec b \in \mathrm{C}(A) .

Therefore, if A\vec x = \vec{b_1} and \vec{b_1} \notin \mathrm{C}(A) , then we know that no solution exists, a conversely, if it is then we know that at least one solution exists.

The set of vectors in the column space of may or may not be linearly independent. When reduced to being LI then we have a basis for the column space of the matrix.

We saw the following in the section of null spaces: It can be shown that the column vectors of a matrix A are LI if and only if N(A) = {\vec 0} . So we can figure out if we have a basis by looking at A 's null space. Recall that \mathrm{N}(A) = \mathrm{N}(\mathrm{RREF}(A)) and that the column vectors of a matrix A are LI if and only if N(A) = {\vec 0} !

How do we figure out the set of column vectors that are LI? To do this, put the matrix in RREF and select the columns that have pivot entries (the pivot columns). The pivot column vectors are the basis vectors for \mathrm{C}(A) :) The reason is that you can always write the columns with the free variables as linear combinations of the pivot columns.

Because column vectors containing a pivot are all-zeros except for the pivot, the set of pivot columns must also be LI. You can always represent the free columns as linear combinations of the pivot columns.

So to find column space of A , pick the columns from A that correspond to (have the same position as) the pivot columns in \mathrm{RREF}(A) .

A really interested property of the column space is the column space criterion which states that a linear system, A\vec x = \vec b has a solution iff \vec b is in the column space of A.

Rank Of A Matrix

We can now define what the rank of a matrix is: \begin{align} \mathrm{rank}(A) &= \mathrm{dim}(\mathrm{C_{basis}}(A)) \\ &= \text{#pivot columns in}\ \mathrm{RREF}(A) \end{align} I.e., The rank of a matrix is the number of pivots. This also means that the rank is the number of vectors in a basis for the column space of A . You can get this by counting the pivot columns in \mathrm{RREF}(A) .

The rank equation is \mathrm{rank}(A) + \mathrm{nullity}(A) = n , where A is an m x n matrix.

If A is reduced to row-echelon form H , then:

  1. \mathrm{nullity}(A) is the number of free variables in solution space of A\vec x = \vec 0 , which equals the number of pivot-free columns in H .
  2. \mathrm{rank}(A) is the number of pivots in H .
  3. \mathrm{rank}(A) + \mathrm{nullity}(A) = n , equals the number of columns of A .
TODO:
https://math.stackexchange.com/questions/21100/importance-of-matrix-rank

Basis Of The Column Space VS Basis Of The Null Space

Really good answer by John-Paul Meyer on the Khan Academy website:

... a "basis of a column space" is different than a "basis of the null space", for the same matrix....

A basis is a a set of vectors related to a particular mathematical "space" (specifically, to what is known as a vector space). A basis must:

  1. be linearly independent and
  2. span the space.

The "space" could be, for example: \mathbb{R}^2 , \mathbb{R}^3 , \mathbb{R}^4 , ... , the column space of a matrix, the null space of a matrix, a subspace of any of these, and so on...

Orthogonal Matrices

So, onto orthogonal matrices. A matrix is orthogonal if AA^T = A^TA = I . If we take a general matrix and multiply it by it's transpose we get...

AA^T = \begin{pmatrix} a & c \\ b & d \\ \end{pmatrix} \begin{pmatrix} a & b \\ c & d \\ \end{pmatrix} = \begin{pmatrix} aa + cc & ab + cd \\ ab + cd & bb + dd \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ \end{pmatrix}

The pattern ab + cd looks pretty familiar right?! It looks a lot like x_1x_2 + y_1y_2 , our formula for the dot product of two vectors. So, when we say a=x_1 , b=x_2 , c=y_1 and d=y_2 , we will get the following: a matrix of two row vectors \vec v_1 = [x_1, y_1] , \vec v_2 = [x_2, y_2] . A = \begin{pmatrix} \vec v_1 \\ \vec v_2 \\ \end{pmatrix} = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \therefore AA^T = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + y_1^2 & x_1x_2 + y_1y_2 \\ x_1x_2 + y_1y_2 & x_2^2 + y_2^2 \\ \end{pmatrix} If our vectors \vec v_1 and \vec v_2 are orthogonal, then the component x_1x_2 + y_1y_2 must be zero. This would give us the first part of the identify matrix pattern we're looking for.

The other part of the identity matrix would imply that we would have to have x_1^2 + y_1^2 = 1 and x_2^2 + y_2^2 = 1 ... which are just the formulas for the square of the length of a vector. Therefore, if our two vectors are normal (i.e, have a length of 1), we have our identity matrix.

Does the same hold true for A^TA ? It doesn't if we use our original matrix A!... A^TA = \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + x_2^2 & x_1y_1 + x_2y_2 \\ x_1y_1 + x_2y_2 & y_1^2 + y_2^2 \\ \end{pmatrix} Oops, we can see that we didn't get the identity matrix!! But, perhaps we can see why. If A was a matrix of row vectors then A^T is a matrix of column vectors. So for AA^T we were multiplying a matrix of row vectors with a matrix of column vectors, which would, in part, give us the dot products as we saw. So if we want to do A^TA it would follow that for this to work A now was to be a matrix of column vectors because we get back to our original ( A^T would become a mtrix of row vectors): A^TA = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + y_1^2 & x_1x_2 + y_1y_2 \\ x_1x_2 + y_1y_2 & x_2^2 + y_2^2 \\ \end{pmatrix}

So we can say that if we have matrix who's rows are orthogonal vectors and who's columns are also orthogonal vectors, then we have an orthogonal matrix!

Okay, thats great n' all, but why should we care? Why Are orthogonal matricies useful?. It turns our that orthogonal matricies preserve angles and lengths of vectors. This can be useful in graphics to rotate vectors but keep the shape they construct, or in numerical analysis because they do not amplify errors.

Determinants

The Fundamentals

If B is defined as follows... B = \begin{bmatrix} a & b \\ c & d \\ \end{bmatrix} Then we write the determinate of B as... \mathrm{det}(B) = |B| = \begin{vmatrix} a & b \\ c & d \end{vmatrix} = ad - bc

A matrix is invertible iff |B| \ne 0 . We can define the inverse as follows using the determinant: B^{-1} = \frac{1}{|B|} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} It is defined as long as |B| \ne 0 .

If A is a 3x3 matrix defined as follows, where the elements a_{ij} represents the element at row i and column j : A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} We can define the determinant as: \mathrm{det}(A) = a_{11} \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{vmatrix} - a_{12} \begin{vmatrix} a_{21} & a_{23} \\ a_{31} & a_{33} \end{vmatrix} + a_{13} \begin{vmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{vmatrix} Same rule applies: If the determinant is not zero then the matrix will be invertible.

We've used the first row ( a_{11}, a_{12}, a_{13} ) but we could have used any row, or any column: choose the one with the most zeros as this will make the calculation easier/faster.

Select the row or column with the most zeros and go through it. Use each element a_{ij} multiplied by the determinant of the 2x2 matrix you get if you take out the ith row and jth column of the matrix.

This generalises to an nxn matrix as follows: \stackrel{A}{\small{nxn}} \ = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} Define A_{ij} to be the (n-1) x (n-1) matrix you get when you ignore the ith and jth column of the nxn matrix A.

Then we can define the determinant recursively as follows: \textrm{det}(A) = a_{11}|A_{11}| - a_{12}|A_{12}| + a_{13}|A_{13}| - \cdots + a_{1n}|A_{1n}| For each |A_{ij}| , apply the formula recursively until a 2x2 matrix is reached, at which point you can use the 2x2 determinant formula.

There is a pattern to the "+"'s and "-"'s: \textrm{sign}(i, j) = -1^{(i+j)}

A quick example. A is defined as follows: A = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 1 & 0 & 2 & 0 \\ 0 & 1 & 2 & 3 \\ 2 & 3 & 0 & 0 \end{bmatrix} We find the determinant as follows, where as we did previously, we define A_{ij} to be the (n-1) x (n-1) matrix you get when you ignore the ith and jth column of the nxn matrix A. \textrm{det}(A) = -2|A_{41}| + 3|A_{42}| See how the 4th row was used as it contains 2 zeros, meaning that we only have to actually do any work for two of the terms. Note also how we got the signes for the various components using \textrm{sign}(i, j) = -1^{(i+j)} . Another way to visualise this is shown below: \begin{bmatrix} + & - & + & - \\ - & + & - & + \\ + & - & + & - \\ - & + & - & + \end{bmatrix} This now needs to be applied recusively, so we figure out the determinants of A_{41} and A_{42} in the same way.

Let B = A_{41} : |B| = \begin{vmatrix} 2 & 3 & 4 \\ 0 & 2 & 0 \\ 1 & 2 & 3 \end{vmatrix} = 2|B_{22}| = 2\begin{vmatrix} 2 & 4 \\ 1 & 3 \end{vmatrix} = 2(2*3 - 4*1) = 4 Note again how we picked row 2 because it only had 1 term which was non-zero and hence made our life easier.

Let C = A_{42} \begin{aligned} |C| &= \begin{vmatrix} 1 & 3 & 4 \\ 1 & 2 & 0 \\ 0 & 2 & 3 \end{vmatrix} \\ &= -1|C_{21}| + 2|C_{22}| = -1\begin{vmatrix} 3 & 4 \\ 2 & 3 \end{vmatrix} + 2\begin{vmatrix} 1 & 4 \\ 0 & 3 \end{vmatrix} \\ &= -1(9-8) +2(3-0) \\ &= 5 \end{aligned} We picked row 2 again, but we could have just as easily used columns 1 or 3 to the same effect.

Now we can write: \begin{aligned} \textrm{det}(A) &= -2|A_{41}| + 3|A_{42}|\\ &= -2|B| + 3|C|\\ &= -2(4) + 3(5) \\ &= 7 \end{aligned} We can see this answer verified in the section "Do It In Python"

Do It In Python

>>>import numpy as np
>>> A = np.mat("1 2 3 4;1 0 2 0; 0 1 2 3; 2 3 0 0")
>>> np.linalg.det(A)
7.000000000000004

See It In HTML

The following is an online matrix determinant calculator app, written in Javascript, to explain calculation of determinant. Define a matrix in the text box using a Matlab-like sytax. For example, "1 2 3; 3 4 5; 5 6 7", defines the 3x3 matrix: \begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \\ 5 & 6 & 7 \end{bmatrix} Use it to create matricies to see, step by step, with explanations, how the determinant is calculated. Just replace the text in the message box below and hit the button to see how you can solve for the determinant of your matrix...

Rule Of Sarrus

For a 3x3 matrix defined as follows: A = \begin{bmatrix} a & b & c\\ d & e & f\\ g & h & i \end{bmatrix} The determinant is defined as: |A| = aei + bfg + cdh - afh - bdi - ceg Notice how it is the diagonals (wrapped) summed going from left to right, and the diagonals (wrapped) subtracted going from right to left.

Effect When Row Multiplied By Scalar

Multiplying a row in a matrix by a scalar multiplies the determinant by that scalar. This is shown anecodtally below. \begin{align} \begin{vmatrix} a & b \\ kc & kd \end{vmatrix} &= akd - bkc\\ &=k(ad - bc) \\ &= k \begin{vmatrix} a & b \\ c & d \end{vmatrix} \end{align}

Effect When Matrix Multiplied By Scalar

Multiplying an entire nxn matrix by a scalar, scales the determinant by n2. Shown anecdotally below: \begin{align} \begin{vmatrix} ka & kb \\ kc & kd \end{vmatrix} &= kakd - kbkc \\ &= k^2(ad - bc) \\ &= k^2\begin{vmatrix} a & b \\ c & d \end{vmatrix} \end{align}

Row Op: Effect Of Adding A Multiple Of One Row To Another

Say that A is defined as follows: A = \begin{bmatrix} a & b \\ c & d \end{bmatrix} Now lets say we apply the row operation Row1 = R1 + kR2 to get the matrix: B = \begin{bmatrix} a + kc & b + kd \\ c & d \end{bmatrix} Now... \begin{align} |B| &= \begin{vmatrix} a + kc & b + kd \\ c & d \end{vmatrix} \\ &= (a + kc)d - (b + kd)c \\ &= ad - bc \\ &= |A| \end{align} I.e. adding a multiple of one row to another doesn not change the determinant, as shown anecdotally above! Works that same if we do Row1 = R1 - kR2.

Row Op: Effect Of Swapping Rows

Swapping A's rows and taking the determinant gives us: \begin{align} \begin{vmatrix} c & d \\ a & b \end{vmatrix} &= bc - ad \\ &= -(ad - bc) \\ &= -|A| \end{align} So, we can see that for a 2x2 matrix, swapping rows changes the sign of the determinant. Generalise to nxn.

Effect Of Adding A Row From One Matrix To Another

Let's say we have two matricies: A = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \ \text{and} \ B = \begin{bmatrix} a & b \\ e & f \end{bmatrix} Create a new matrix by adding the second column of A to B: Z = \begin{bmatrix} a & b \\ c + e & d + f \end{bmatrix} We will find out that the derminant of Z is the sum of the determinants for A and B: \begin{align} |Z| &= a(d+f) - b(c+e) \\ &= ad - bc + af - be \\ &= |A| + |B| \end{align} Generalises to nxn matricies.

When A Matrix Has Duplicate Rows

We have seen that when we swap two rows that the determinant is negated. But, if we swap two identical rows the matrix determinant should not change! The only way this can be true is if the determinant is zero.

This, somewhat anecodtally, we can say that if a matrix has a duplicate row then it's determinant must be zero! The same is true if one row is a simple scalar multiple of another.

We have also seen that the inverse of a matrix is only available when its determinant is non-zero. Therefore we now know that matricies with duplicate rows are not invertible. Same still applies if on row is a scalar multiple of another.

Upper Triangular Determinant

An upper triangular matrix is one where everything below the diagonal is zero. For upper triangular matricies the determinant is the multiplication of the diagonal terms: \begin{vmatrix} a & b \\ 0 & d \end{vmatrix} = ad And: \begin{vmatrix} a & b & c \\ 0 & d & e \\ 0 & 0 & f \end{vmatrix} = adf This generalises to nxn matricies: \textrm{det}(A_{\text{triangular}}) = \prod_{i=1}^{n}a_{ii} The same applies to lower triangular matrixies.

This method gives us a way of getting more efficient determinants for matricies if we can get them into an upper or lower diagonalisation. The Khan lectures give this example of a matrix reduction (remember that we saw that adding multiples of one row to another does not change the determinant): \begin{vmatrix} 1 & 2 & 2 & 1 \\ 1 & 2 & 4 & 2 \\ 2 & 7 & 5 & 2 \\ -1 & 4 & -6 & 3 \end{vmatrix} ~ -\begin{vmatrix} 1 & 2 & 2 & 1 \\ 0 & 3 & 1 & 0 \\ 0 & 0 & 2 & 1 \\ 0 & 0 & 0 & 7 \end{vmatrix} = -(1 \times 3 \times 2 \times 7) = -42 Note the minus because a row was swapped to do the reduction!

The Determinant And The Area Of A Parallelogram

This is one of the better YouTube tutorials I've found that explains the concepts and not just the mechanics behind determinants, by 3BlueBrown (awesome!).

Now, onto more notes from Khan Academy, which also give an abolsolutely awesome explanation! Also, this Mathematics StackExchange Post gives loads of different ways of proving this.

For two vectors, each made a column vector of a matrix A, the determinant of A gives the area of the parallelogram formed by the two vectors:


<View source>.

Put a little more mathsy-like, if you have two vectors, call them, \vec{v_1} and \vec{v_2} , then the area of the parallelogram formed by the vectors, as shown above, is given by: \textrm{Area} = \begin{vmatrix} | & | \\ \vec{v_1} & \vec{v_2} \\ | & | \end{vmatrix}

To aid in the following explanation you might need to refer to the section on projections.

The area of a parallelogram is given by the base times the height. In the above we can see that the base of the parallelogram is just the length b = \|\vec{v_1}\| . The height of the parallelogram, we can see is h = \|\vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2})\| . So... \begin{align} a &= b \times h \\ &= \|\vec{v_1}\| \|\vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2})\| \\ &= \sqrt{\vec{v_1} \cdot \vec{v_1}} \sqrt{(\vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2})) \cdot (\vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2}))}\ \blacktriangleleft \text{ because }\ \|\vec{a}\|^2 = \vec{a} \cdot \vec{a} \end{align} Get rid of the square roots by squaring both sizes: \begin{align} a^2 &= \bigl( \vec{v_1} \cdot \vec{v_1} \bigr) \bigl( ( \vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2}) ) \cdot ( \vec{v_2} - \textrm{Proj}_{\vec{v_1}}(\vec{v_2}) ) \bigr)\\ &= \Bigl( \vec{v_1} \cdot \vec{v_1} \Bigr) \Bigl( \vec{v_2} \cdot \vec{v_2} -2\vec{v_2}\cdot\textrm{Proj}_{\vec{v_1}}(\vec{v_2}) + \bigl(\textrm{Proj}_{\vec{v_1}}(\vec{v_2})\bigr)^2 \Bigr)\ \blacktriangleleft \text{ dot product is distributive } \\ &= \left( \vec{v_1} \cdot \vec{v_1} \right) \left( \vec{v_2}\cdot\vec{v_2} - 2\vec{v_2}\cdot\left(\frac{\vec{v_1}\cdot\vec{v_2}}{\|\vec{v_1}\|^2} \vec{v_1}\right) + \left( \left( \frac{\vec{v_1}\cdot\vec{v_2}}{\|\vec{v_1}\|^2} \vec{v_1} \right) \cdot \left( \frac{\vec{v_1}\cdot\vec{v_2}}{\|\vec{v_1}\|^2} \vec{v_1} \right) \right) \right) \blacktriangleleft\ \text{ because }\ \textrm{Proj}_L(\vec x) = \frac{\vec v \cdot \vec x}{\|\vec v\|^2} \vec v,\ \text{such that}\ \vec v \in L\\ &= \left( \vec{v_1} \cdot \vec{v_1} \right) \left( \vec{v_2}\cdot\vec{v_2} - 2\vec{v_2}\cdot\left(\frac{\vec{v_1}\cdot\vec{v_2}}{\|\vec{v_1}\|^2} \vec{v_1}\right) + \left( \left( \frac{\vec{v_1}\cdot\vec{v_2}}{\|\vec{v_1}\|^2} \right)^2 \left( \vec{v_1} \cdot \vec{v_1} \right) \right) \right) \blacktriangleleft\ \text{ dot product is associative} \\ &= \left( \vec{v_1} \cdot \vec{v_1} \right) \left( \vec{v_2}\cdot\vec{v_2} - 2\vec{v_2}\cdot\left(\frac{\vec{v_1}\cdot\vec{v_2}}{\vec{v_1} \cdot \vec{v_1}} \vec{v_1}\right) + \frac{(\vec{v_1}\cdot\vec{v_2})^2}{\vec{v_1} \cdot \vec{v_1}} \right)\\ &= (\vec{v_1}\cdot\vec{v_1})(\vec{v_2}\cdot\vec{v_2}) - \left(2\vec{v_2}\cdot\left(\frac{\vec{v_1}\cdot\vec{v_2}}{\vec{v_1} \cdot \vec{v_1}} \vec{v_1}\right)\right)(\vec{v_1}\cdot\vec{v_1}) + \left(\frac{(\vec{v_1}\cdot\vec{v_2})^2}{\vec{v_1} \cdot \vec{v_1}}\right)(\vec{v_1}\cdot\vec{v_1}) \\ &= (\vec{v_1}\cdot\vec{v_1})(\vec{v_2}\cdot\vec{v_2}) - 2\bigl((\vec{v_1} \cdot \vec{v_2})\cdot(\vec{v_1} \cdot \vec{v_2})\bigr) + (\vec{v_1}\cdot\vec{v_2})^2 \\ &= (\vec{v_1}\cdot\vec{v_1})(\vec{v_2}\cdot\vec{v_2}) - (\vec{v_1} \cdot \vec{v_2})\cdot(\vec{v_1} \cdot \vec{v_2}) \end{align} Now we have to solve the dot products, which means we will have to define our two vectors like so: \text{Let}\ \vec{v_1} = \begin{bmatrix} a\\b\end{bmatrix} \text{ and}\ \vec{v_2} = \begin{bmatrix} c\\d\end{bmatrix} Substituting these vector definitions into the above we can expand out the dot products: \begin{align} a^2 &= (\vec{v_1}\cdot\vec{v_1})(\vec{v_2}\cdot\vec{v_2}) - (\vec{v_1} \cdot \vec{v_2})\cdot(\vec{v_1} \cdot \vec{v_2}) \\ &= (a^2 + b^2)(c^2 + d^2) - (ac + bd)(ac + bd) \\ &= a^2c^2 + a^2d^2 + b^2c^2 + b^2d^2 - (a^2c^2 + 2abcd + b^2d^2) \\ &= a^2d^2 + b^2c^2 - 2abcd \\ &= (ad - bc)(ad - bc)\\ &= |A|^2\\ \therefore a &= |A| \end{align}

Now, in the section on linear transforms, we will see that the columns vectors of the transform matrix show what happens to the canocial basis vectors. Thus, if the determinant of the transformation matrix is the area of the parallelogram formed by the column vectors of the matrix, the determinant gives us the scaling factor of the transformation!

Rowspace & Left Nullspace & Orthogonal Complements

Left Null Space Definition

Notes based on Khan Academy with extra reference to MIT Linear Algebra eBook Chapter.

We saw in the 3rd null space example the following matrix, and its reduction to RREF: \text{Let } A = \begin{bmatrix} 2 & -1 & -3 \\ -4 & 2 & 6 \end{bmatrix} We calculate the null space by setting up an augmented matrix to solve for A\vec x = \vec 0 : \left[\begin{array}{ccc|c} 2 & -1 & -3 & 0\\ -4 & 2 & 6 & 0 \end{array}\right] \mapsto \left[\begin{array}{ccc|c} 1 & -1/2 & -3/2 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right] Giving... \mathrm{N}(A) = \mathrm{span}\left(\begin{bmatrix}\frac{1}{2}\\ 1\\ 0 \end{bmatrix}, \begin{bmatrix}\frac{3}{2}\\ 0\\ 1\end{bmatrix}\right) But, we also know that because only the first column in the RREF form has a pivot entry, the basis for the column space of the matrix A only contains the first column vector: \mathrm{C}(A) = \mathrm{span}\left(\begin{bmatrix}2 \\ -4\end{bmatrix}\right) We also know the rank of the matrix: \mathrm{rank}(A) = \mathrm{dim}(C_{\text{basis}}(A)) = 1 Now we take the transpose of A : A^T = \left[\begin{array}{cc|c} 2 & -4 & 0\\ -1 & 2 & 0\\ -3 & 6 & 0 \end{array}\right] \mapsto \text{ rref } \mapsto \left[\begin{array}{cc|c} 1 & -2 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right] So, now we can calculate the nullspace of the transpose of A. We see that x_1 = 2x_2 and that x_2 is a free variable so: \mathrm{N}(A^T) = \mathrm{span}\left(\begin{bmatrix}2 \\ 1\end{bmatrix}\right) The \mathrm{N}(A^T) is called the left null space of A . Why have we called in the left null space? Have a look at this... \mathrm{N}(A^T) = \left\{\vec x \mid A^T\vec x = \vec 0\right\} ...by the very definition of what a null space is. We are solving for A^T\vec x = \vec 0 : \begin{align} (A^T\vec x) &= \vec 0 \\ \therefore (A^T\vec x)^T &= \vec 0 ^ T \ \blacktriangleleft \text{ transpose both sides}\\ \vec x^T(A^T)^T &= \vec 0 ^ T \\ \vec x^T A &= \vec 0 ^ T \end{align} Thus, the null space of A^T can also be defined as so: \mathrm{N}(A^T) = \left\{\vec x \mid \vec x^TA = \vec 0^T\right\} Now we see why it is called the "left" null space: the vector \vec x^T is on the LHS of the transformation matrix!

Rowspace

The rowspace of the transformation A is defined as \mathrm{C}(A^T) : \mathrm{Rowspace}(A) = \mathrm{C}(A^T) Recall from above: A^T = \left[\begin{array}{cc|c} 2 & -4 & 0\\ -1 & 2 & 0\\ -3 & 6 & 0 \end{array}\right] \mapsto \text{ rref } \mapsto \left[\begin{array}{cc|c} 1 & -2 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right] The only pivot column is the first, so: \mathrm{Rowspace}(A) = \mathrm{C}(A^T) = \mathrm{span}\left(\begin{bmatrix}2 \\ -1 \\ -3\end{bmatrix}\right)

Relationship Between The Spaces: Orthogonal Complements

An orthogonal complement is defined as follows. If V is a subspace of \mathbb{R}^n , the orthogonal complement of V is written V^\perp as is defined as so: V^\perp = \{\vec x \in \mathbb{R}^n \mid \vec x \cdot \vec v = 0\ \forall \vec v \in V\} In other words, every member of V^\perp is orthogonal to every member of V .

It can be shown that V^\perp is a valid subspace: it is closed under addition and scalar multiplication.

To find the orthogonal complement of V , take its generating vectors and place them as row vectors in a matrix. If, V \in \mathbb{R}^n and the generating set of vectors, or span, of V is \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_k}\} then form the k by n matrix: A = \begin{bmatrix} \leftarrow & \vec{v_1} & \rightarrow \\ \leftarrow & \vec{v_2} & \rightarrow \\ & \vdots \\ \leftarrow & \vec{v_k} & \rightarrow \end{bmatrix} In the above matrix, V becomes the row space of A . Because the null space of A is the orthogonal complement of this matrix, the orthogonal compement to your subspace V is found by finding the null space of the above matrix!

The column space and left-null space are perpendicular. They are orthogonal complements. Thus we can say:

\begin{align} \mathrm{C}(A) &= \left(\mathrm{N}(A^T)\right)^\perp\ \text{ and}\\ \mathrm{N}(A^T) &= \left(\mathrm{C}(A)\right)^\perp \end{align}

The null space and row space are perpendicular. They are orthogonal complements. Thus we can say:

\begin{align} \mathrm{N}(A) &= \left(\mathrm{C}(A^T)\right)^\perp\ \text{ and}\\ \mathrm{C}(A^T) &= \left(\mathrm{N}(A)\right)^\perp \end{align}
We can show that \mathrm{C}(A^T) = \left(\mathrm{N}(A)\right)^\perp . We want to show that if \vec x \in \mathrm{N}(A) then \vec x \cdot \vec a = 0 for all \vec a \in (\mathrm{C}(A^T))^\perp . We know that \mathrm{N}(A) = \{\vec x \in \mathbb{R}^\mathbf{\color{blue}{n}} \mid A\vec x = \vec 0\} which we can write as: \underbrace{ \begin{bmatrix} \leftarrow r_1^T \rightarrow \\ \leftarrow r_2^T \rightarrow \\ \vdots \\ \leftarrow r_m^T \rightarrow \end{bmatrix} }_{m \times n} \vec x = \underbrace{ \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} }_{\mathbf{\color{blue}{n}} \times 1} Where r_i^T is the ith row vector of A .

We know that, by the definition of matrix definition we can say the following: \vec r_1^T \vec x = 0 \\ \vec r_2^T \vec x = 0 \\ \vdots \\ \vec r_m^T \vec x = 0 \\ The above is just the the matrix multiplication shown above written out. But, we know that \vec r_i^T \vec x is the same as the dot product, so it is equal to \vec r_i \cdot \vec x . I.e., the row vectors of A must all be orthogonal to \vec x if \vec x is a member of the null space.

Thus, the vectors in the null space are \perp to the column vectors of A^T . Thus, \mathrm{N}(A) \subseteq (\mathrm{C}(A^T))^\perp .

If we let B^T = A , then \mathrm{N}(B^T) \subseteq \mathrm{C}(B)^\perp and thus \mathrm{N}(A) = \mathrm{C}(A^T)^\perp .

Relating The Dimension Of A Subspace To Its Orthogonal Complement

The dimensions of each subspace can also be related: \mathrm{dim}(V) + \mathrm{dim}(V^\perp) = n , where V is a subspace of \mathbb{R}^n . This, therefore, means that if \mathrm{dim}(V) = k , then \mathrm{dim}(V^\perp) = n - k .

Representing A Vector As A Combination Of A Vector In A Subspace And One In Its Orthogonal Complement

This relation implies that V \bigcup V^\perp = \mathbb{R}^n . So, if V_b is a basis for V then V_b \bigcup V_b^\perp is a basis for \mathbb{R}^n . This leads us to an importan property where we can say that any vector in \mathbb{R}^n can be represented uniquely by some vector in a subspace V of \mathbb{R}^n and some vector in V^\perp : \forall \vec a \in \mathbb{R}^n: \vec a = \vec v + \vec v_p \text{ where } \vec v \in V \text{ and } \vec v_p \in V^\perp

We can also notice something important about the intersection of a subset and its orthogonal complement: if \vec x \in V and \vec x \in V^\perp then \vec x = 0 . I.e. V \bigcap V^\perp = \{0\} .

Unique Rowspace Solution To A\vec x = \vec b

We have seen that any vector in \mathrm{R}^n can be represented as a combination of \vec a \in \mathrm{N}(A) and \vec b \in \mathrm{N}(A)^\perp = \mathrm{C}(A^T) .

So if \vec x is a solution to A\vec x = \vec b , where x \in \mathrm{R}^n , then we can express \vec x as \vec x = \vec{r_0} + \vec{n_0} where \vec{r_0} \in \mathrm{C}(A^T) and \vec{n_0} \in \mathrm{N}(A) .

I.e., the solution is a linear combination of some vector in the rowspace of A and some vector in the null space of A .

\begin{align} \vec x &= \vec{r_0} + \vec{n_0} \text{ where } \vec{r_0} \in \mathrm{C}(A^T), \vec{n_0} \in \mathrm{N}(A) \\ \therefore A\vec x &= A(\vec{r_0} + \vec{n_0}) \\ &= A\vec{r_0} + A\vec{n_0} \\ &= A\vec{r_0}\ \text{ because, by defn } A\vec{n_0} = \vec 0 \\ &= \vec b \end{align} Thus, A\vec{r_0} = \vec b , which means that \vec{r_0} is a solution!

We have also seen, in the section on the column space that the solution, if it exists, must be in the column space of A . Therefore, for any vector, \vec b , in the column space of A , there is a unique vector, \vec{r_0} in the rowspace of A that is the solution. Woah!!

It can also be shown that this is a unique solution. Proof ommited.

To summarise: any solution to A\vec x = \vec b can we written as \vec x = \vec{r_0} + \vec{n_0} and we know there is a unique solution in the rowspace of A, \mathrm{C}(A^T) .

This means that we can also write: \|\vec x\|^2 = \|\vec{r_0}\|^2 + \|\vec{n_0}^2\| \ge \|\vec{r_0}\|^2 \therefore \|\vec x\|^2 \ge \|\vec{r_0}\|^2 \ge 0

This is a very important point, as we'll see later on when this gets applied to things like least square estimation. If \mathbf{\vec b \in \mathrm{C}(A)} , then \mathbf{\exists!\vec{r_0} \in \mathrm{C}(A^T)} such that \mathbf{\vec{r_0}} is a solution to \mathbf{A\vec x = \vec b} and no other solution has a length less than \mathbf{\|\vec{r_0}\|} .

Summary

The orthogonal complement V^\perp of a subspace V \in \mathbb{R}^n has the following properties:

  • V^\perp is a subspace of \mathbb{R}^n .
  • \mathrm{dim}(V^\perp) = n - \mathrm{dim}(V) .
  • (V^\perp)^\perp = V .
  • Each vector \vec b in \mathbb{R}^n can be expressed unqiuely in the form \vec b = \vec{b_V} + \vec{b_{V^\perp}} for \vec{b_V} \in V and \vec {b_{V^\perp}} \in V^\perp .
  • If {\vec b \in \mathrm{C}(A)} , then {\exists!\vec{r_0} \in \mathrm{C}(A^T)} such that {\vec{r_0}} is a solution to {A\vec x = \vec b} and no other solution has a length less than {\|\vec{r_0}\|}

A Visual Exploration

\text{Let } A = \begin{bmatrix}3 & -2\\ 6 & -4\end{bmatrix}, \text{ and } \vec b = \begin{bmatrix}9 \\18\end{bmatrix} We know that the nullspace of A is defined as \mathrm{N}(A) = \{\vec x \in \mathbb{R}^2 \mid A\vec x = \vec 0\} and that \mathrm{N}(A) = \mathrm{N}(\mathrm{RREF}(A)) , so we can determine the null space by reducing [A | 0] to its RREF as so: \left[\begin{array}{cc|c} 3 & -2 & 0\\ 6 & -4 & 0\end{array}\right] \mapsto \left[\begin{array}{cc|c}1 & -\frac{2}{3} & 0\\ 0 & 0 & 0\end{array}\right] \therefore x_1 = \frac{2}{3}x_2 \therefore \mathrm{N}(A) = \left\{x_2 \begin{bmatrix}\frac{2}{3} \\1\end{bmatrix},\ x_2 \in \mathbb{R}\right\} = \mathrm{span}\left(\begin{bmatrix}\frac{2}{3} \\1\end{bmatrix}\right) Now the rowspace, aka \mathrm{C}(A^T) : A^T = \begin{bmatrix}3 & 6\\ -2 & -4\end{bmatrix} \mapsto \begin{bmatrix}1 & 2\\ 0 & 0\end{bmatrix} We have seen that the column space of a matrix will be the column vectors associated with the pivot columns. Therefore: \mathrm{RowSpace}(A) = \mathrm{C}(A^T) = \mathrm{span}\left(\begin{bmatrix}3 \\ -2\end{bmatrix}\right)

So, what about the solution set? \left[\begin{array}{cc|c} 3 & -2 & 9\\ 6 & -4 & 18\end{array}\right] \mapsto \left[\begin{array}{cc|c}1 & -\frac{2}{3} & 3\\ 0 & 0 & 0\end{array}\right] \therefore x_1 = \frac{2}{3}x_2 + 3 \therefore\text{ Solution Set } = \left\{ x_2 \color{red}{ \underbrace{\begin{bmatrix}\frac{2}{3} \\1\end{bmatrix}}_{\vec{n_0} \in \mathrm{N}(A)} } + \color{green}{ \begin{bmatrix}3 \\0\end{bmatrix} },\ x_2 \in \mathbb{R} \right\} We can see that the solution set is a scaled vector from the null space plus some particular solution. These can be plotted as so:


<View source>

We can also see that, as we discovered previously, the shortest solution to the problem is the vector formed by the null-space vector scaled to zero plus a vector in the rowspace:


<View source>

Eigenvectors and Eigenvalues

Stackoverflow thread: What is the importance of eigenvalues and eigenvectors.

another

Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction, ... An eigenvalue is a number, telling you how much variance there is in the data in that direction ...

Linear Transformations

Functions

Scalar Functions

Normally a function is a mapping between two sets: from a domain to a range, which is a subset of the function's co-domain. I.e., the co-domain is the set of all possible things the function could map onto and the range is the subset of the co-domain containing the things the function actually maps to. If we have two sets, X and Y , we say that if f is a function, then it associates elements of X with elements of Y . I.e., f is a mapping between X and Y , and we write f: X \mapsto Y .

co-domain vs range

The usual function notation looks like this: f(x) = x^2 . Note that f is the function, not f(x) , which is the value output by the function, f , when x is input. There are other notations we use to denote functions.

For example the function f(x) = x^2 can be described as a mapping from the set of all real numbers to the set of all real numbers. f: \mathbb{R} \rightarrow \mathbb{R} It can also be defined using this notation: f: x \mapsto x^2

A function is not, however, a simple mapping between two sets. For a mapping to be a function the following must be true: f(x) = y_1 \cap f(x) = y_2 \implies y_1 = y_2 . In other words one item in the domain cannot map to more than one item in the range, but two distinct items in the domain can map to the same item in the range.

a function vs a mapping

Vector Valued Functions

Functions that map to \mathbb{R} are called scalar values functions, or specifically in this case, real valued functions.

If the co-domain of a function is in \mathbb{R}^m , where m > 1 , then we say that the function is vector valued. Functions, where the domain is in \mathbb{R}^m are called transformations.

A vector values function could like something like this, for example: f\left( \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \right) = \begin{bmatrix} 2\pi x_1 + \frac{5}{2}x_3\\ 4x_2 \end{bmatrix} It could look like anything... the point is, because f operates on a vector, we call it a transformation.

Surjective (Onto) Functions

A surjective, or onto, function provides a mapping from the domain to every element of the co-domain. I.e., the range is the co-domain, or the image of the function is the co-domain. If f: X \mapsto Y and f is surjective, then \mathrm{Im}(f) = Y . \forall y \in Y, \exists \text{ at least one } x \mid f(x) = y TODO: Insert diagram

Injective (One-To-One) Functions

An injective, or one-to-one, function provides a mapping to every element of the co-domain such that no two elements from the domain can map to the same value in the co-domain. Note that not all domain members need to be mapped however. \forall y \in Y, \exists \text{ at most one } x \mid f(x) = y TODO: Insert diagram

Bijective (One-To-One And Onto)

A function is bijective (one-to-one and onto) when it is both surjective (onto) and injective (one-to-one). \forall y \in Y, \exists \text{ exactly one } x \mid f(x) = y TODO: Insert diagram

Invertible Functions

A function \mathrm{f}: X \mapsto Y is invertible if there exists another function \mathrm{f^{-1}}: Y \mapsto X such that \mathrm{f^{-1}}(\mathrm{f}(x)) = \mathrm{I_x}(x) = x .

In the above \mathrm{I_x}(x) is the identity function: \mathrm{I_x}: X \mapsto X where \forall a \in X, \mathrm{I_x}(a) = a .

If a function \mathrm{f} is invertible then the function \mathrm{f^{-1}} is unique and there is only one unique solution to \mathrm{f}(x) = y .

To summarise, the function \mathrm{f} is invertible iff, \forall \mathrm{f^{-1}}: Y \mapsto X,\ \mathrm{f^{-1}} \circ \mathrm{f} = \mathrm{I_x} Which means that there is only one solution to \mathrm{f}(x) = y , so we can also say: \mathrm{f}: X \mapsto Y \ \text{is invertible}\ \Leftrightarrow \forall y \in Y, \exists! x \in X \mid \mathrm{f}(x) = y The part of the expression, \forall y \in Y implies that the range of the function is the co-domain, which means that the function is surjective (onto) and the part of the expression \exists! x \in X \mid \mathrm{f}(x) = y means that the function is also injective (onto), which thus means that the function is bijective (one-to-one and onto).

Thus to be invertible a function must be bijective (one-to-one and onto)!!

Linear Transformations

Vector valued functions are transformations. Linear transformations are a particular type of transformation that obey the following rules. If T: \mathbb{R}^n \mapsto \mathbb{R}^m and \vec a, \vec b \in \mathbb{R}^n then:

  1. T(\vec a + \vec b) = T(\vec a) + T(\vec b)
  2. T(\alpha \vec a) = \alpha T(\vec a) \mid \alpha \in \mathbb{R}
  3. T(\vec 0) = \vec 0

In other words, the origin must remain fixed and all lines must remain lines. For example, a transformation such as \mathrm{f}(x) = x^2 is non linear because \mathrm{f}(a + b) \ne \mathrm{f}(a) + \mathrm{f}(b) , generally, for example.

Matrices are used to do the linear transformations so the function T will generally be defined as T(\vec v) = A\vec v , where A is the transformation matrix. 2D vectors transform 2D space, 3D vectors transform 3D space and so on...

Interestingly the column vectors of the transformation vector A , tell us what happens to the canonical basis vectors in that space, and from this we can work out how any vector would be transformed.

Graph showing the linear transformation of the canonical vectors for R^2

We have seen that in cartesian space, \mathbb{R}^2 , for any coordinate, (x, y) , the numbers x and y are the coefficients of the canonical basis vectors... \begin{bmatrix} x \\ y \end{bmatrix} = x \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix} Lets say our transformation matrix has the following properties: T : \begin{bmatrix} 1 \\ 0 \end{bmatrix} \mapsto \begin{bmatrix} a \\ b \end{bmatrix} \\ T : \begin{bmatrix} 0 \\ 1 \end{bmatrix} \mapsto \begin{bmatrix} c \\ d \end{bmatrix} Well, we've seen we can re-write any vector as a combination of the above basis vectors, so we can write: \begin{align} T\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) &\mapsto T\left(x \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix}\right)\\ & \mapsto xT\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) + yT\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right) \\ & \mapsto \begin{bmatrix} \color{red}ax + \color{blue}by \\ \color{red}cx + \color{blue}dy \end{bmatrix} \end{align} Which looks suspiciously like a matrix, vector product. If we describe the transformation, T as follows: T = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} Then... T\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} \color{red}ax + \color{blue}by \\ \color{red}cx + \color{blue}dy \end{bmatrix} An so we can see that the first column vector of T shows us where the first canonical basis vector [1, 0] maps to and the second column vector of T shows us where the second canonical basis vector [0, 1] maps to... woop! In other words, just to be clear... T = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} = \begin{bmatrix} T\left( \begin{bmatrix} 1 \\ 0 \end{bmatrix} \right) & T\left( \begin{bmatrix} 0 \\ 1 \end{bmatrix} \right) \end{bmatrix} We can also note that this the same as taking the transform of each of the column vectors of the identity matrix!

We can see this more "solidly" in the sketch below...

Graph showing the linear transformation of the canonical vectors for R^2

We can see how one coordinate can be represented using the canonical Cartesian coordinates, but also coordinates in our transformed space. Thus the same "point" can be represented using different coordinates (coefficients of the basis vectors) for different transform spaces.

Another important point is that it can be shown that all matrix-vector products are linear transformations!

Linear transformations also have these important properties, if we take two transformations S: \mathbb{R}^n \mapsto \mathbb{R}^m and T: \mathbb{R}^n \mapsto \mathbb{R}^m , where S has the transformation matrix A and T has the transformation matrix B :

  • Transformation addition: (S+T)(\vec x) = S(\vec x) + T(\vec x) = (A+B)\vec x .
  • Scalar multiplication: (cS)(\vec x) = cS(\vec x) = (cA)\vec x .

Graph showing a linear transformation doing rotations

Inverse Linear Transformations

We saw in the section on functions that an invertible function has to be bijective (one-to-one and onto). As a transform is just a vector valued function, the same condition for invertibility exists. Recall that: \mathrm{f}: X \Rightarrow Y \ \text{is invertible}\ \Leftrightarrow \color{red}{\forall y \in Y}, \color{blue}{\exists! x \in X} \mid \mathrm{f}(x) = y Where the red inidcates that the range is the domain (mapping is surjective or onto) and the blue indicates that the mapping is unqiue (injective or onto-to-one). We can also write this as follows: A \ \text{is invertible} \rightarrow \exists T^{-1} \mid T^{-1} \circ T = I \cap T \circ T^{-1} = I

It can be shown that invertibility implies a unique solution to f(x)=y .

We want to understand under what conditions our standard linear transform, T: \mathbb{R}^n \rightarrow \mathbb{R}^m,\ \mathrm{T}(\vec x) = A \vec x , is invertible. i.e. when is it both injective (ont-to-one) and surjective (onto)?

Surjective (Onto)

Start with surjective (onto)... we are looking at \mathrm{T}(\vec x) = \vec b , i.e., A\vec x = \vec b ...

We can write A\vec x as its column vectors multiplied by the elements of \vec b : A\vec x = \vec{a_1}x_1 + \vec{a_2}x_2 + \cdots + \vec{a_n}x_n Remember that a surjective, or onto, function provides a mapping from the domain to every element of the co-domain. Thus for our transform to be surjective (onto) the set of the above linear combinations must span \mathbb{R}^m .

This means that for T to be sujective (onto) the column vectors of A have to span \mathbb{R}^m , i.e., the column space of A must span \mathbb{R}^m (look at the equation above to see why). I.e. \mathrm{C}(A) = \mathbb{R}^m . Thus, the challenge becomes finding this out, because if it is true we are half way there to showing that the transform is invertible.

For the column space to span \mathbb{R}^m (i.e. \mathrm{C}(A) = \mathrm{R}^m ) we would need the solution to A\vec x = \vec b to have infinitely many solutions. We saw in the section of RREF that when an augmented matrix is reduced to RREF we can determine the number of solutions by looking at where the pivots are.

To recap, when a matrix is in RREF:

  • Free variables mean many solutions, \therefore \mathrm{C}(A) = \mathrm{R}^m , which is what we want!
  • No free variables (all columns have a pivot) means a unique solution, \therefore \mathrm{C}(A) \ne \mathrm{R}^m .
  • Pivot or function in the augmented column means no solution or restricted solution set respectively and \therefore \mathrm{C}(A) \ne \mathrm{R}^m .

Thus, to show that a transform is surjective (onto) we need the RREF of the transformation matrix to have at at least one solution... i.e., there is a unique solution or many solutions - the first two items on the list above.

Thus, there must be a pivot in every row of ou m\times n matrix A , so for be onto we can say \mathrm{rank}(A) = m .

Injective (One-To-One)

For invertivility, onto is not enough. We need one-to-one as well! It can be shown that the solution set to A\vec x = \vec b is a the set of shifted versions of the null space. I.e., assuming that A\vec x = \vec b has a solution, the solution set is {\vec x_p + n \mid n \in \mathrm{N}(A)} [Ref].

This means that for A to be one-to-one we require the null space of A contain only the zero vector. I.e., for a transformation matrix to be one-to-one, its null space must be trivial.

If the null space of A is trivial then \mathrm{N}(A) = {\vec 0} \rightarrow \ \text{column vectors of A are LI} . Because the column vectors a LI, they form a basis of n vectors so \mathrm{dim}(A) = n \rightarrow \mathrm{rank}(A) = n .

Thus, the transform A is one-to-one if and only if \mathrm{rank}(A) = n .

Invertible: Injective (One-To-One) and Surjective (Onto)

We saw that to be surjective, \mathrm{rank}(A) = m , and to be injective, \mathrm{rank}(A) = n . Therefore, to be invertible, A must be a square matrix. As it's rank is n , this also means that every column must contain a pivot, and because all the columns form a basis, i.e., are LI, if we reduce A to RREF, it must be the elementary matrix, I . Thus, \mathrm{RREF}(A) = I must also be true.

The inverse mapping is also linear. To recap, linearity implies: \mathrm{T}(\vec x + \vec y) = \mathrm{T}(\vec x) + \mathrm{T}(\vec y) and \mathrm{T}(c\vec x) = c\mathrm{T}(\vec x)

Finding The Inverse

To find the inverse of a matrix we augment it with the equivalently size elementry matrix and reduce it to RREF.

Let... A = \left[ \begin{array}{cc} 2 & 2 \\ 0 & 5\\ \end{array} \right] Augment A with I to form [A \mid I] and reduce: \left[ \begin{array}{cc|cc} 2 & 2 & 1 & 0 \\ 0 & 5 & 0 & 1 \\ \end{array} \right] \rightarrow rref \rightarrow \left[ \begin{array}{cc|cc} 1 & 0 & 1/2 & -1/5\\ 0 & 1 & 0 & 1/5 \\ \end{array} \right] Thus, A^{-1} = \left[ \begin{array}{cc} 1/2 & -1/5\\ 0 & 1/5 \\ \end{array} \right] And we see that it is indeed the inverse because: AA^{-1} = \left[ \begin{array}{cc} 2 & 2 \\ 0 & 5 \\ \end{array} \right] \left[ \begin{array}{cc} 1/2 & -1/5\\ 0 & 1/5 \\ \end{array} \right] = \left[ \begin{array}{cc} 1 & 0\\ 0 & 1 \\ \end{array} \right] And: A^{-1}A = \left[ \begin{array}{cc} 1/2 & -1/5\\ 0 & 1/5 \\ \end{array} \right]\left[ \begin{array}{cc} 2 & 2 \\ 0 & 5 \\ \end{array} \right] = \left[ \begin{array}{cc} 1 & 0\\ 0 & 1 \\ \end{array} \right] As we've seen, if A isn't square then don't bother - it doesn't have an inverse. If it is square and doesn't reduce to RREF, then it is singular and does not have an inverse.

Projections

The graph to the below shows a vector \vec x and its projection onto the line L , defined by the vector \vec v . The projection of \vec x onto L describes how much of \vec x goes in the direction of L .

Vector projections
<View source>

We write "the projection of \vec x onto L like so: \textrm{Proj}_L(\vec x) The projection is found by dropping a line, perpendicular to L from the top of \vec x , down to L . Where this line hits L is the end of the vector that describes the projection. This is shown on the graph by the smaller green vector labelled \textrm{Proj}_L , and the purple line labelled \vec x - \textrm{Proj}_L .

Because \vec x - \textrm{Proj}_L(\vec x) is perpendicular to L we can say the following (recall perpendicular vector's dot product is zero): (\vec x - \textrm{Proj}_L(\vec x)) \cdot \vec v = 0,\ \text{where}\ \vec v\ \text{is a defining vector of the line L} \therefore (\vec x - c\vec v) \cdot \vec v = 0, \text{because}\ c\vec v \in L\ \text{and}\ \textrm{Proj}_L(\vec x) \in L\ \text{it means, for some}\ c,\ c\vec v = \textrm{Proj}_L(\vec x) \therefore c = \frac{\vec v \cdot \vec x}{\|\vec v\|^2} Thus the projection is defined as: \textrm{Proj}_L(\vec x) = \frac{\vec v \cdot \vec x}{\|\vec v\|^2} \vec v,\ \text{such that}\ \vec v \in L And if \vec v is a unit vector \hat v , then this simplifies to: \textrm{Proj}_L(\vec x) = (\hat v \cdot \vec x) \hat v And we have seen in a previous section how to make a vector a unit vector. Whoopdi do!

An important point to note - which wasn't immediately obvious to me, but probably should've been in hindsight - is that with this projection-onto-a-line, the line projected onto has to pass through the origin! This is seen in better detail later in the second least squares example.

However, I realised this is fine because when we're talking about subspaces they must contain the zero vector, because they are closed under scalar multiplication, so when projecting onto a subspace, the line, plane, or whatever, must pass through the origin! It is only when one starts thinking about graphs or graphics that we need to consider the case where the line doesn't pass through the origin, so this is a case that does not bother us here :)

Turns out that projection is a linear transform (see the rules)! (But the line must pass through the origin!).

To calculate the matrix that describes the projection transform, apply the projection to each of the column vectors of the identity matrix where the column vectors of I are the same length as the vector over which the projection is applied. So, for example, vectors in \mathbb{R}^n to get the transformation matrix A for a projection onto a line defined by the vector \hat u : A = \begin{bmatrix} \textrm{Proj}_L(I_{\text{col 1}}) & \textrm{Proj}_L(I_{\text{col 2}}) & \cdots & \textrm{Proj}_L(I_{\text{col n}}) \end{bmatrix}

This definition has been defined only for a projection onto a line, which is a valid subspace (recall: closed under addition and scalar multiplication). But what about projections onto planes in 3D or onto any abitrary "space" in \mathbb{R}^n ? It does, in fact, generalise to a projection onto any subspace!

If V is a subspace of \mathbb{R}^n , then we have seen that V^\perp is also a valid subspace of \mathbb{R}^n . We have also seen that we can express any vector in \mathbb{R}^n as the sum of a vector in V and a vector in V^\perp .

I.e., if \vec{x} \in \mathbb{R}^n , \vec v \in V and \vec w \in V^\perp , then we can express \vec x uniquely as \vec x = \vec v + \vec w , as we have seen in a previous section. Thus we can say: \vec x = \underbrace{\vec v}_{\text{Part of } \vec x \text{ from V}} + \underbrace{\vec w}_{\text{Part of } \vec x \text{ from } V^\perp} Therefore: \begin{align} \mathrm{Proj}_V(\vec x) &= \text{Part of } \vec x \text{ from } V\\ &= \vec v \in V \end{align} And: \begin{align} \mathrm{Proj}_{V^\perp}(\vec x) &= \text{Part of } \vec x \text{ from } V^\perp\\ &= \vec w \in V^\perp \end{align} We know that a projection is a linear transformation and is therefore defined by some A\vec x = \vec b , where \vec x is the vector we are projecting, and \vec b is the projection of that vector. But, projected onto what? We can see this in the diagram below, which replicates the previous problem's matrix A :

We have projected \vec x onto the row space of our projection matrix A , and the projection is the unique solution from the rowspace. So the row space is, in 2D world, the line we have projected \vec x onto, and in any-D world, the subspace we have projected \vec x onto. We can say: \mathrm{proj}_{\mathrm{C}(A^\perp)}(\vec s) = \vec r In the above problem, the easiest solution was the vector [3, 0] , so we have: \begin{align} \mathrm{proj}_{\mathrm{C}(A^\perp)}(\vec s) &= \frac{ \begin{bmatrix} 3 \\ 0\end{bmatrix} \cdot \begin{bmatrix} 3 \\ -2\end{bmatrix} } { \begin{bmatrix} 3 \\ -2\end{bmatrix} \cdot \begin{bmatrix} 3 \\ -2\end{bmatrix} } \begin{bmatrix} 3 \\ -2\end{bmatrix} \\ &= \begin{bmatrix} \frac{27}{13} \\ \frac{-18}{13}\end{bmatrix} \\ &= \vec r \end{align} So, we can see the definition of projection for onto a line generalises to a projection onto any subspace

What I'm not clear on so far...

We know how to find the projection matrix for our vector onto a subspace by applying the formula to the column vectors of the identity matrix. So we can get our projection matrix.

Above we've said that the projection is the vector from one of the the othog complements, say V. So the project of an arbitrary vector x onto V will be r if V is the rowspace of the transformation matrix.

So the rowspace of the projection matrix, that we know how to find, must therefore contain r.

To find a projection of x onto V
1. Find the basis vector set for V
2. Find the basis vector set for V^\perp
3. Find the coordinate vector of x relative to the basis vectors of V \union V^\perp - as we can express any vector like this
4. Then b onto V is just the combo of the vector basis components of V

So this must relate projection to change of basis too!! Aaaarg... brain melty

A Projection Onto A Subspace Is A Linear Transform

Say that V is a subspace of \mathbb{R}^n and that V has basis vectors \{\vec{b_1}, \vec{b_2}, \ldots, \vec{b_k}\} . This means that any vector \vec a \in V can be defined as: \vec a = y_1\vec{b_1} + y_2\vec{b_2} + \cdots + y_k\vec{b_k} We know from the definition of matrix multiplcation that we can re-write the above as: \vec a = \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vec{b_1} & \vec{b_2} & \cdots & \vec{b_k} \\ \downarrow & \downarrow & & \downarrow\\ \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_k \end{bmatrix} = A\vec y Where A = \underbrace{ \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vec{b_1} & \vec{b_2} & \cdots & \vec{b_k} \\ \downarrow & \downarrow & & \downarrow\\ \end{bmatrix} }_{ \text{cols are basis vecs of subspace } V } Therefore, \forall \vec a \in V , we can say A\vec y = \vec a for some \vec y \in \mathbb{R}^k .

If we have some vector, \vec x \in \mathbb{R}^n , then \mathrm{proj}_V(\vec x) \in V by the definition of a projection. Because, as we have shown above, \forall \vec a \in V,\ \exists \vec y \in \mathbb{R}^k\ |\ A\vec y = \vec a , we can say: \mathrm{proj}_V(\vec x) = A\vec y We also know that we can say: \vec x = \mathrm{proj}_V(\vec x) + \mathrm{proj}_{V^\perp}(\vec x) \implies \vec x - \mathrm{proj}_V(\vec x) = \mathrm{proj}_{V^\perp}(\vec x) Because \mathrm{proj}_{V^\perp}(\vec x) \in V^\perp , by definition of a projection, we must therefore be able to say, because of the equality above that: \vec x - \mathrm{proj}_V(\vec x) \in V^\perp Now we note that our matrix, A , has the its column vectors set as the basis vectors of V . Therefore \mathrm{C}(A) = V . We can take the orthogonal complement of both sides to get \bigl(\mathrm{C}(A)\bigr)^\perp = V^\perp . From here, we can recall that \bigl(\mathrm{C}(A)\bigr)^\perp = \mathrm{N}(A^T) , and that therefore V^\perp = \mathrm{N}(A^T) . Armed with this knowlege, we can substitue this into the above to get: \vec x - \mathrm{proj}_V(\vec x) \in \mathrm{N}(A^T) By the definition of a null space, we can say that: \begin{align} A^T\bigl(\vec x - \mathrm{proj}_V(\vec x)\bigr) &= \vec 0 \\ A^T\vec x - A^T\mathrm{proj}_V(\vec x) &= \vec 0 \end{align} And, we say a couple of stages back that \mathrm{proj}_V(\vec x) = A\vec y , so we can substitue this in to get: \begin{align} A^T\vec x - A^TA\vec y &= \vec 0 \\ \therefore A^T\vec x &= A^TA\vec y \end{align} Now, if A^TA was invertible then we can solve for \vec y . It was state (without any proof) in the matrix transpose rules list that this is need the case because the columns of A , in this case, are LI. Thus: \begin{align} \text{As } A^T\vec x &= A^TA\vec y \\ \therefore (A^TA)^{-1}A^T\vec x &= \vec y \end{align} Because we set \vec y = \mathrm{proj}_V(\vec x) , we have found an interesting formula for the projection of \vec x onto V :

\mathbf{\mathrm{proj}_V(\vec x) = (A^TA)^{-1}A^T\vec x}
This is a linear transform because (A^TA)^{-1}A is just another matrix! So if we have the basis vectors for our subspace we have a formula to project any vector onto it :)

A Projection Is The Closest Vector In The Subspace Projected Onto

We kinda already saw this. The vector \vec r was the shortest solution and this was the projection onto the subspace.

Also the distance from \vec x to \mathrm{proj}_V(\vec x) is the shorter than the distance from x to any other vector in V : \| \vec x - \mathrm{proj}_V(\vec x) \| \le \| \vec x - \vec v\|,\ \forall \vec v \in V

Least Squares

Lets say we have an n by k matrix such that: \underbrace{A}_{n \times k} \underbrace{\vec x}_{\in \mathbb{R}^k} = \underbrace{\vec b}_{\in \mathbb{R}^n} Lets also say that there is NO solution to the above. We know that we can write out A\vec x as A\vec x = x_1\vec{a_1} + x_2\vec{a_2} + \cdots + x_k\vec{a_k} . The fact that there is no solution means that there are no coefficients \vec {x_i} that can satisfy the equation. I.e. \vec b \notin \mathrm{C}(A) .

What we'd like to do is find a vector \vec{x^*} , which is called our least squares estimate, where A\vec{x^*} is as close as possible to \vec b ! I.e., we want to minimise \mathbf{\| \vec b - A\vec{x^*} \|} . So whilst \vec b may have no solution, \vec{x^*} is a solution, and it will be the solution with the least error, or distance, from \vec b .

Let \vec v = A\vec{x^*} . Then we are going to minimise \| \vec b - \vec v \| . Because \vec v is a solution, it means that \vec v \in \mathrm{C}(A) . So the question is, what vector in A 's column space should we pick? Well, we have seen that the projection of a vector onto a subspace is the closest vector in that subspace to the projectee. So we pick the projection of \vec b onto the column space of A!

TODO INSERT DIAGRAM TO SHOW THIS

So we want the following, as we know it will produce the least distance between \vec v and \vec b : \vec v = A\vec{x^*} = \mathrm{proj}_{C(A)}(\vec b) We have previously seen that there is a formula for the projection matrix. We could just use that, but as Sal continues to explain there is an eaier way - calculating the projection matrix involves a bit of work so lets be lazy :)

Start by subtracting \vec b from both sides of the above equation... \vec v - \vec b = A\vec{x^*} - \vec b = \mathrm{proj}_{C(A)}(\vec b) - \vec b Here \mathrm{proj}_{C(A)}(\vec b) - \vec b is orthogonal to \mathrm{proj}_{C(A)}(\vec b) .

TDO INSERT DIAGRAM

Thus, A\vec{x^*} - \vec b is equal to somthing that is orthogonal to the projection, i.e., A\vec{x^*} - \vec b \in \mathrm{C}(A)^\perp . We have also seen that \mathrm{C}(A)^\perp = \mathrm{N}(A^T) , so we can re-write the above as: \underbrace{A\vec{x^*} - \vec b}_{\text{just a vector}} \in \mathrm{N}(A^T) And, by definition of a null space, \mathrm{N}(A^T) will be the set of solutions that are sent to the zero vector, ie., solutions that satisfy A^T\vec d = \vec 0 . So, \begin{align} A^T(A\vec{x^*} - \vec b) = \vec 0 \\ A^TA\vec{x^*} - A^T\vec b = \vec 0 \\ A^TA\vec{x^*} = A^T\vec b \end{align} We've ended up with:

A^TA\vec{x^*} = A^T\vec b
We have derived the above formula from our original problem so we know its a valid solution. And using this is much easier than calculating the projection matrix: all we do is take the transpose of A so that we'll know A^T , A , and \vec b and can therefore solve for \vec{x^*} , our least squares estimate.

Example 1

Let's take a look at Sal's first example, an overdetermined set of linear equations: \begin{align} 2x - y = 2 & \rightarrow y = 2x - 2 \\ x + 2y = 1 & \rightarrow y = -0.5x + 0.5 \\ x + y = 2 & \rightarrow y = -x + 4 \end{align} This means that our coefficient matrix is a 3 by 2 matrix, and we get: \begin{bmatrix} 2 & -1 \\ 1 & 2 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \\ 4 \end{bmatrix} We saw that A^TA\vec{x^*} = A^T\vec b and we know both A and \vec b so, \begin{align} A^TA &= \begin{bmatrix} 2 & 1 & 1 \\ -1 & 2 & 1\end{bmatrix}\begin{bmatrix}2 & -1 \\ 1 & 2 \\ 1 & 1\end{bmatrix} \\ &= \begin{bmatrix} 6 & 1 \\ 1 & 6 \end{bmatrix} \end{align} And... \begin{align} A^T\vec b &= \begin{bmatrix}2 & 1 & 1 \\ -1 & 2 & 1\end{bmatrix}\begin{bmatrix} 2 \\ 1 \\ 4\end{bmatrix}\\ &= \begin{bmatrix} 9 \\ 4\end{bmatrix} \end{align} Thus, by substituting these into our least squares equation we get: \begin{bmatrix} 6 & 1 \\ 1 & 6 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 9 \\ 4 \end{bmatrix} Which we can solve in the normal way: \left[\begin{array}{cc|c} 6 & 1 & 9 \\ 1 & 6 & 4 \end{array}\right] \rightarrow \cdots \rightarrow \left[\begin{array}{cc|c} 1 & 0 & 10/7 \\ 0 & 1 & 3/7 \end{array}\right] Leaving us with our least squares solution to the problem: \therefore \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 10/7 \\ 3/7 \end{bmatrix} = \vec{x^*}

Example 2

Quite often we will have sets of data that look like they are linearly related, but the data points don't line up into an exact straight line. However, we can visually imagine fitting a line through our noisy data that would give us the best approximation, or the least error with respect to all the data points.

So to Khan Academy's second example. We have a set of data points \bigl\{(-1, 0), (0,1), (1,2), (2,1)\bigr\} . There is no solution that can put a line through these points, but we can generate a least squares estimate that will produce a line "through" these points that generates the least error: minimises the sum of the square errors from each point to the line of best fit.

TODO insert graph

Once again, our points generate a set of linear equations in the form of y = mx + b : \begin{align} (-1, 0) &\ \rightarrow \ \ \color{red}0 = \color{blue}{-1}m + b\\ (0, 1) &\ \rightarrow \ \ \color{red}1 = \color{blue}0m + b\\ (1, 2) &\ \rightarrow \ \ \color{red}2 = \color{blue}1m + b\\ (2, 1) &\ \rightarrow \ \ \color{red}1 = \color{blue}2m + b\\ \end{align} We will solve for m and b to generate our line of best fit. Thus, A = \begin{bmatrix} \color{blue}{-1} & 1 \\ \color{blue}0 & 1 \\ \color{blue}1 & 1 \\ \color{blue}2 & 1\end{bmatrix}\ \text{ and } \ \vec b = \begin{bmatrix}\color{red}{0 \\ 1 \\ 2 \\ 1}\end{bmatrix} As we saw, A^TA\vec{x^*} = A^T\vec b , so: A^TA = \begin{bmatrix}-1 & 0 & 1 & 2\\1 & 1 & 1& 1\end{bmatrix}\begin{bmatrix} -1 & 1 \\ 0 & 1 \\ 1 & 1 \\ 2 & 1\end{bmatrix} = \begin{bmatrix}6 & 2 \\ 2 & 4\end{bmatrix} And: A^T\vec b = \begin{bmatrix}-1 & 0 & 1 & 2\\1 & 1 & 1& 1\end{bmatrix} \begin{bmatrix}0 \\ 1 \\ 2 \\ 1\end{bmatrix} = \begin{bmatrix}4 \\ 4\end{bmatrix} Now we can solve for \vec{x^*} : \begin{bmatrix}6 & 2 \\ 2 & 4\end{bmatrix}\begin{bmatrix}m \\ b\end{bmatrix} = \begin{bmatrix}4 \\ 4\end{bmatrix} Using the standard augmented matrix method: \left[\begin{array}{cc|c} 6 & 2 & 4 \\ 2 & 4 & 4 \end{array}\right] \rightarrow \cdots \rightarrow \left[\begin{array}{cc|c} 1 & 0 & 2/5 \\ 0 & 1 & 4/5 \end{array}\right] Therefore, m = 2/5 and b = 4/5 :)

Projection Onto Line Must Go Through Origin

WOAH, in trying to do one of the graphs I've found our that the projection formula from the previous section only really makes sense if the line passes through the origin as a line defined by a vector (with no yintercept) has to pass through origin.

Also found this SO post, which confirmed my suspeicions. It put me onto the idea of affine transforms.

I've realised this is fine because when we're talking about subspaces they must contain the zero vector, because they are closed under scalar multiplication, so when projecting onto a subspace, the line, plane, or whatever, must pass through the origin! It is only when one starts thinking about graphs or graphics that we need to consider the case where the line doesn't pass through the origin, so this is a case that does not bother us here, generally :) It is only a bother when I wanted to do a graphica plot like below:

Here are the projections of the data points, but using proj formula they project onto the solution vector passing trhough the origin, not the actual solution vector which has a non-zero y intercept!!!


<View source>.

So, what we need to do is to translate the real line so that it passes through the origin, apply the same transform to the data points, project them onto the transformed line, and the apply the inverse translation to the data points and their projections. Then we get what we want:


<View source>.

The above is not calculated using an affine transform in the way described in this SO post, but does the shifting manually to achieve the same effect.

BUT, if we have to do all of this work, how does the reasoning for LSE work? The answer is because all subspaces contain the zero vector. We're only concerned with direction. The catesian coordinates are only a problem because we're trying to plot things here... generally when projecting onto subspaces, we know they contain the origin, because this is a property of a subspace!

Affine Transforms

http://www.cs.uu.nl/docs/vakken/gr/2011/Slides/05-transformations.pdf

https://www.youtube.com/watch?v=2Snoepcmi9U

Change Of Basis

A Definition

In 2D Cartesian space we are used to specifying coordinates like (2,1) , which means 2 integer units along the x-axis and 1 along the y-axis. These coordinates are, in fact, coefficients of a linear combination of the standard basis vectors \hat i and \hat j , such that we would write (2,1) as 2\hat i + 1\hat j , where, \hat i = \begin{bmatrix}1\\0\end{bmatrix},\ \hat j = \begin{bmatrix}0\\1\end{bmatrix} The coefficents in the linear combination of the space's basis vectors are called the coordinates. I.e., (\color{blue}2, \color{red}1)_\text{std basis} = \color{blue}2\begin{bmatrix} 1 \\ 0\end{bmatrix} + \color{red}1\begin{bmatrix} 0 \\ 1\end{bmatrix}

The standard basis in \mathbb{R}^2 aren't the only basis vectors, as we know. We could use the vectors \begin{bmatrix} 1 \\ 1\end{bmatrix} and \begin{bmatrix} 0 \\ 1\end{bmatrix} , in which case all coordinates (x, y) are defined as the coefficients of the linear combination of the 2 basis vectors we just chose: x_\text{new basis}\begin{bmatrix} 1 \\ 1\end{bmatrix} + y_\text{new basis}\begin{bmatrix} 0 \\ 1\end{bmatrix}

What would our coordinates, (2,1)_\text{std basis} , written w.r.t the standard basis be if we wrote them w.r.t our new basis? 2\begin{bmatrix} 1 \\ 0\end{bmatrix} + 1\begin{bmatrix} 0 \\ 1\end{bmatrix} = \begin{bmatrix} 2 \\ 1\end{bmatrix} \downarrow x_\text{new basis}\begin{bmatrix} 1 \\ 1\end{bmatrix} + y_\text{new basis}\begin{bmatrix} 0 \\ 1\end{bmatrix} = \begin{bmatrix} 2 \\ 1\end{bmatrix} With respect to our new basis our x-like and y-like coordinates would be x_\text{new basis} = 2 and y_\text{new basis} = -1 . So w.r.t our new basis, the coordinates are the coefficients of the basis vectors so we would write (2, -1) . I.e., (2, 1)_\text{std basis} = (\color{blue}2, \color{red}{-1})_\text{new basis} = \color{blue}2\begin{bmatrix} 1 \\ 1\end{bmatrix} + \color{red}{-1}\begin{bmatrix} 0 \\ 1\end{bmatrix} So... the coefficients in the linear combination of the basis vectors for your subspace represent to coordinates w.r.t those chosen basis vectors.

We can see this graphically below, where we see the same two vectors plotted with respect to our different bases. You can see that with the non-standard basis, because everything is expressed as combinations of the non standard basis vectors, our graph's grid has become skewed. This is the effect of the change in basis.


<View source>.

Or, to put it another way, but this time with scaled vectors...


<View source>


We can generalise the above and introduce some new notation to make things clearer:

If V is a subspace of \mathbb{R}^n , and if B is a basis for V , where B = \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}\} , then \vec a \in V \rightarrow \vec a = c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_n\vec{v_n} Where the coordinates of \vec a are \text{coordinates of } \vec a \text{ w.r.t } B = \left[\vec a\right]_B = \begin{bmatrix}c_1 \\ c_2 \\ \vdots \\ c_n\end{bmatrix} Thus in our above examples, we would have said, if B was our new basis: \left[(2, 1)\right]_B = (2, -1) And \left[(2, -1)\right]_\text{std basis} = (2, 1) When we just write (2, 1) , without the above notation, the standard basis is implied.

Change Of Basis Matrix

We have seen that to construct a transformation matrix we apply the transformation to each column vector of the identity matrix. This means that if B = \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_k}\} is a basis for \mathbb{R}^k , then the transformation matrix, C , is: C = \begin{bmatrix} \uparrow & \uparrow & & \uparrow \\ \vec{v_1} & \vec{v_2} & \cdots & \vec{v_k} \\ \downarrow & \downarrow & & \downarrow \\ \end{bmatrix} Any vector w.r.t to B would be written as: \left[\vec a\right]_B = \begin{bmatrix}c_1 \\ c_2 \\ \vdots \\ c_k\end{bmatrix} But here's the interesting thing about the notation. If we were to express \vec a as a linear combination of the basis vectors we would write: \vec a = c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_k\vec{v_k} Notice how we did not write [\vec a]_B . The reason for this is that no matter the basis vectors, the linear combination of the coordinates and basis vectors will always result in a vector in \mathbb{R}^k , which is expressed by the standard basis.

Thus our change of basis matrix does the following:

C\left[\vec{a}_b\right] = \vec a

Because, C\left[\vec{a}_b\right] = \begin{bmatrix} \uparrow & \uparrow & & \uparrow \\ \vec{v_1} & \vec{v_2} & \cdots & \vec{v_k} \\ \downarrow & \downarrow & & \downarrow \\ \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\c_l \end{bmatrix} = c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_k\vec{v_k} = \vec a

Lets try this using our simple skew example. The change of basis matrix would be: C = \begin{bmatrix} 1 & 1\\ 0 & 1\end{bmatrix} We know that we transformed [0\ \ 1] to [1\ \ 1] , so [1\ \ 1]_B is [0\ \ 1] in the standard basis: \begin{align} C{\begin{bmatrix} 0 \\ 1\end{bmatrix}}_B &= \begin{bmatrix} 1 & 1\\ 0 & 1\end{bmatrix}\begin{bmatrix} 0 \\ 1\end{bmatrix} \\ &= \begin{bmatrix} 1 \\ 1\end{bmatrix} \end{align} Which is exactly what we expected.

In this case, because C is square and we know, by definition of its construction, that its column vectors are L.I., that C must be invertible. This means we can do the opposite using:

\left[\vec{a}\right]_B = C^{-1}\vec a

In this case, C^{-1} is trivially found: C^{-1} = \begin{bmatrix} 0 & 1 \\ 1 & 0\end{bmatrix} So, \begin{align} \left[{\vec a}_B\right] &= C^{-1}\begin{bmatrix}1 \\ 1\end{bmatrix}\\ &=\begin{bmatrix} 1 & -1 \\ 0 & 1\end{bmatrix}\begin{bmatrix}1 \\ 1\end{bmatrix}\\ &=\begin{bmatrix}0 \\ 1\end{bmatrix} \end{align} Which, is again what we expect. Happy days :)

Invertible Change Of Baisis Matrix

We have seen that if B is a basis and its vectors are the column vectors of a matrix C , then C is the transformation matrix that does the change of basis - it is the change-of-basis matrix: C\left[\vec a\right]_B = \vec a If C is square, then because its columns are L.I. (because they're a basis), we can say that: C^{-1}C\left[\vec a\right]_B = C^{-1}\vec a And therefore:

\left[\vec a\right]_B = C^{-1}\vec a

Transformation Matrix w.r.t A Basis

Let T be T: \mathbb{R}^n \rightarrow \mathbb{R}^n , such that T(\vec x) = A\vec x . If T transforms \vec x w.r.t. the standard basis to a vector \vec y , then there is a transform that transforms \left[\vec x\right]_B to \left[\vec y\right]_B , where B is a basis for \mathbb{R}^n .

Lets write this other transform as D\left[\vec x\right]_B = \left[\vec y\right]_B . We know \vec y = A \vec x , so we can write: \begin{align}D\left[\vec x\right]_B &= \left[A\vec x\right]_B\\ &= \left[T(\vec x)\right]_B \end{align} Therefore, D is the transformation matrix of T w.r.t the basis B .

Note, at the point, that D is some random transform... maybe we're rotating vectors, scaling them or something. D is not trying to change basis... it is just a transformation that works on vectors expressed w.r.t. a specific basis.

To change basis, we need a change of basis matrix. We saw in the previous section that \left[\vec x\right]_B = C^{-1}\vec x , so we know that we can go from any vector w.r.t. the standard basis to the same vector w.r.t. to any other basis, where the basis is the column vectors of C .

This means that so far we have: D\left[\vec x\right]_B = \left[T(\vec x)\right]_B = \left[A\vec x\right]_B = \left[\vec y\right]_B = C^{-1}\vec y = C^{-1}A\vec x = C^{-1}AC\left[\vec x\right]_B Thus:

D\left[\vec x\right]_B = C^{-1}AC\left[\vec x\right]_B \implies D = C^{-1}AC Where C is our change of basis matrix and D is our transformation matrix w.r.t to basis B . D is T w.r.t. B !!

Sal presents this graph which is an excellent summary: \begin{array}{ccc} \large\mathbf{\vec x} & \rightarrow & \small{A=CDC^{-1}}& \rightarrow & \large\mathbf{T\left(\vec x\right)} \\ \downarrow & & & & \downarrow \\ \small{C^{-1}} & & & & \small{C^{-1}}\\ \downarrow & & & & \downarrow \\ \large\mathbf{\left[\vec x\right]_B} & \rightarrow & \small{D=C^{-1}AC} & \rightarrow & \large\mathbf{\left[T\left(\vec x\right)\right]_B} \end{array}

Changing Coordinate Systems To Help Find A Transform Matrix

Sal gives the example of finding the transform for a a reflection in a line, L . Normally we would apply the transform to the identity matrix, "I", to get the transformation matrix, A : A = \begin{bmatrix} \mathrm{T}(\vec e_1) & \mathrm{T}(\vec e_2) \end{bmatrix} We could try to figure out \mathrm{T}(\vec e_i) using trig but it would be difficult and doesn't scale well into 3D and would be very very difficult in higher dimensions!

But, if we use \vec b_1 \in L as our first basis vector and \vec b_2 \in L^\perp as our second basis vector and express our space with respect to this basis, then the problem becomes taking a reflection in the horizontal axis, which is bar far a simpler problem to solve!

If we wanted to reflect \vec y \in \mathbb{R}^2 , then we reflect \left[\vec y\right]_B in the horizontal axis. To do this all we have to do is negate the y-value in the vector \left[\vec y\right]_B . Nice!

So, \vec b = c_1\vec{b_1} + c_2\vec{b_2} = \left[\vec v\right]_B = \begin{bmatrix}c_1\\c_2\end{bmatrix} The transform is thus: \begin{bmatrix}c_1\\c_2\end{bmatrix} \rightarrow \begin{bmatrix}c_1\\-c_2\end{bmatrix} I.e., D\begin{bmatrix}c_1\\c_2\end{bmatrix} = \begin{bmatrix}c_1\\-c_2\end{bmatrix} \therefore \begin{bmatrix}d_1 & d_2 \\ d_3 & d_4\end{bmatrix}\begin{bmatrix}c_1\\c_2\end{bmatrix} = \begin{bmatrix}c_1\\-c_2\end{bmatrix} From this we get: D = \begin{bmatrix}1 & 0 \\ 0 & -1\end{bmatrix} Now we can transform D , which is \left[\mathrm{T}(\vec x)\right]_B , back to \mathrm{T}(\vec x) because we know that A = CDC^{-1} , so happily we can now calculate the original transform!

Orthonormal Basis

If all vectors in a basis B = \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_k}\} are unit length and are orthogonal to each other, then we call this basis orthonormal. Because all the vectors are orthogonal to eachother, they are also independent.

Orthonormal basis make good coordinate systems, but as Sal asks, "What does good mean?". If B is an orthonormal basis for a subspace V , and \vec x \in V , then we can say, by definition: \vec x = c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_i\vec{v_i} + \cdots + c_k\vec{v_k} \therefore \vec x \cdot \vec{v_i} = \underbrace{c_1\vec{v_1}\cdot \vec{v_i} + c_2\vec{v_2}\cdot \vec{v_i} + \cdots}_\text{must be zero} + \underbrace{c_i\vec{v_i}\cdot \vec{v_i}}_\text{must be 1} + \underbrace{\cdots + c_k\vec{v_k}\cdot \vec{v_i}}_\text{must be zero} We know that c_i\vec{v_i}\cdot \vec{v_i} can be the only non-zero quantity because by definition all other vectors are perpendicular to \vec{v_i} , so their dot products with \vec{v_i} must be zero. Therefore we can say that: \vec{v_i} \cdot \vec x = c_i And why is this good?

Well, C\left[\vec x\right]_B = \vec x and \left[\vec x\right]_B = C^{-1}\vec x may be hard to find, especially if C is not invertible. Then we would have to solve for \left[\vec x\right]_B . This method of using \vec{v_i}\cdot\vec x , which only holds for orthonormal basis, however, just lets us do: \left[\vec x\right]_B = \begin{bmatrix}c_1\\c_2\\\vdots\\c_k\end{bmatrix} = \begin{bmatrix}\vec{v_1}\cdot \vec x \\ \vec{v_2}\cdot \vec x \\ \vdots \\ \vec{v_k}\cdot \vec x\end{bmatrix} This is MUCH EASIER! This is why orthonomrla basis are "good": it is easy to figure out coordinates!

Projection Onto Subspace With Orthonormal Basis

We had previously seen that a projection onto any subspace can be defined as: \mathbf{\mathrm{proj}_V(\vec x) = (A^TA)^{-1}A^T\vec x} Where the column vectors of A are the basis vectors for the subspace V that we are projecting onto. However, when the basis is orthonormal, this reduces to:

\mathrm{proj}_V(\vec x) = A^TA\vec x

Much nicer :) Another reason why orthonormal basis are "good"

Orthogonal Matricies Preserve Angles And Lengths

An orthogonal matrix is one whos columns and rows are orthonomal vectors of the same number of elements, i.e., they're square. This means that: A^TA = AA^T = I This also means that: C^T = C^{-1} This makes finding the inverse of the matrix much, much easier!

Such matricies also preserve angles and lengths when they are used as transformation matricies. Thus, if we tranformed two different vectors, then angle between them would stay the same and their lengths would too. In other words, what we are saying is, that under this transform: \|\vec x\| = \|C\vec x\| We can see that such a transform preserves lengths: \begin{align} \|\vec x\|^2 &= \|C\vec x\|^2 \\ &= C\vec x \cdot C\vec x \\ &= (C\vec x)^TC\vec x \\ &=\vec{x}^TC^TC\vec x \\ &= \vec{x}^T\underbrace{C^TC}_{=I}\vec x \\ &= \vec {x}^T\vec x \\ &= \|\vec x\|^2 \end{align} We can, in a similar manner, see that such a transform preserves angles: \begin{align} \cos \theta_C &= \frac{C\vec v \cdot C\vec w}{\|C\vec v\|\|C\vec w\|} \\ &= \frac{C\vec v \cdot C\vec w}{\|\vec v\|\|\vec w\|} \\ &= \frac{\left(C\vec w\right)^T \cdot C\vec v}{\|\vec v\|\|\vec w\|} \\ &= \cdots \\ &= \frac{\vec v \cdot \vec w}{\|\vec v\|\|\vec w\|} \\ &= \cos \theta \end{align}

Make Any Basis Orthonormal: The Gram-Schmidt Process

Any basis can be transformed into an orthonormal basis using the Gram-Schidt processes. It is a recursive method.

If B is a basis \{\vec {v_1},\vec {v_2},\ldots, \vec {v_k}\} , for V , the vectors must be L.I., but not necessarily perpendicular. To make them perpendicular and of unit length, create a subspace V_1 : V_1 = \mathrm{span}\left(\hat{u_1}\right),\text{ where } \hat{u_1} = \frac{\vec{v_1}}{\|\vec{v_1}\|} Next create a subspace, V_2 as follows: \begin{align} V_2 &= \mathrm{span}\left(\vec{v_1}, \vec{v_2}\right) \\ &= \mathrm{span}\left(\hat{u_1}, \vec{v_2}\right) \end{align} We know that we can re-write \vec{v_2} as the sum of two vectors from a subspace and its perpendicular subspace. We can write: \vec{v_2} = \vec{x_2} + \vec{y_2},\text{ where } \vec{x_2} \in V_1 \text{ and } $\vec{y_2} \in V_1^\perp We also know \vec{x_2} = \mathrm{proj}_{V_1}\left(\vec{v_2}\right) And \vec{y_2} = \vec{v_2} - \mathrm{proj}_{V_1}\left(\vec{v_2}\right) Because V_1 has been made an orthonormal basis, we know that \mathrm{proj}_{V_1}\left(\vec{v_2}\right) = (\vec{v_2}\cdot\hat{u_1})\hat{u_1} So, \vec{y_2} = \vec{v_2} - (\vec{v_2}\cdot\hat{u_1})\hat{u_1} We can now, again, normalise \vec{y_2} to become \hat{u_2} in the same way as before: \hat{u_2} = \frac{\vec{y_2}}{\|\vec{y_2}\|} And now we can re-write V_2 as the orthonormal basis V_2 = \mathrm{span}\left(\hat{u_1}, \hat{u_2}\right) We can keep repeating this process for all the vectors in the basis until they are normalised.

Eigen Vectors & Eigen Values

If T\left(\vec v\right) = A\vec v = \lambda\vec v and \vec v \ne \vec 0 , then \vec v is an eigen vector and \lambda is an eigen value.

We can note from the above definition that A only scales the vector \vec v . Because vectors are only scaled, the angles between them must also be maintained. This is a little reminiscent of the section on orthogonal matricies preserving angles and lengths, except in this case, lengths are not maintined.

If A\vec v = \lambda \vec v and a solution exists with \vec v \ne \vec 0 , then eigen vectors and values exist. A\vec v = \lambda \vec v \therefore A\vec v = \lambda I_n \vec v \therefore (\lambda I_n - A)\vec v = \vec 0 \text{, and } \vec v \ne \vec 0 \implies \vec v \in \mathrm{N}\left(\lambda I_n - A\right) We know that \vec v \ne \vec 0 so \mathrm{N}(A) is non trivial. We saw in the section on null spaces that the column vectors of a matrix A are LI if and only if \mathrm{N}(A) = \{ \vec 0 \} .

This means that the column vectors of \lambda I_n - A must not be L.I. This means that the matrix is not invertible, which means that there must be some \lambda for which \mathrm{det}(\lambda I_n - A) = 0 (see the section on determinants - a matrix is only invertible iff |A| \ne 0 ).

A\vec v = \lambda \vec v has solutions for non-zero \vec v 's iff \mathrm{det}(\lambda I_n - A) = 0 . By using this we can solve for \lambda .

Properties

  • Eigen vectors can only be found for square matricies.
  • Not every square matrix has eigen vectors, but if it does and it is an nxn matrix, then there are n eigen vectors.
  • When you find an eigen vector it is normally a good idea to normalise it.

Example 1

An example... A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \lambda is an eigen value of A . \mathrm{det}\left(\lambda \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} - \begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix} \right) = 0 \therefore \mathrm{det}\left(\begin{bmatrix}\lambda - 1 & -2 \\ -4 & \lambda - 3\end{bmatrix}\right) = 0 \therefore (\lambda - 1)(\lambda -3) - 8 = 0 \therefore \lambda^2 - 4\lambda - 5 = 0 \therefore (\lambda - 5)(\lambda + 1) = 0 \therefore \lambda = 5 \text { or } \lambda = -1 We saw above how (\lambda I_n - A)\vec v = 0 \implies \vec v \in \mathrm{N}(\lambda I_n - A) . Thus for any eigen value, the eigen vectors, that correspont to it are the null space \mathrm{N}(\lambda I_n - A) . This is called the eigen space - denoted E_\lambda .

Thus for \lambda = 5 : \begin{align} E_\lambda &= \mathrm{N}(\lambda I_n - A) \\ &= \mathrm{N}\left(\begin{bmatrix}5 & 0\\ 0 & 5\end{bmatrix} - \begin{bmatrix}1 & 2\\ 3 & 4\end{bmatrix}\right) \\ &= \mathrm{N}\left(\begin{bmatrix}4 & -2\\ -4 & 2\end{bmatrix}\right) \\ \end{align} From here we can say, therefore: \begin{align} \begin{bmatrix}4 & -2\\ -4 & 2\end{bmatrix} \vec v = \begin{bmatrix}4 & -2\\ -4 & 2\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = \vec 0 \end{align} Solve using augmented matrix \begin{align} \left[\begin{array}{cc|c}4 & -2 & 0\\ -4 & 2 & 0\end{array}\right] \\ \downarrow \tiny{(r2 = r2 - r1)}\\ \left[\begin{array}{cc|c}4 & -2 & 0\\ 0 & 0 & 0\end{array}\right] \\ \downarrow \tiny{(r1 = 0.25 r1)}\\ \left[\begin{array}{cc|c}1 & -0.5 & 0\\ 0 & 0 & 0\end{array}\right] \\ \end{align} Therefore, v_1 = 0.5v_2 \begin{align} \implies E_{\lambda = 5} &= \left\{t\begin{bmatrix}0.5\\1\end{bmatrix}, t \in \mathbb{R}\right\}\\ &= \mathrm{span}\left(\begin{bmatrix}0.5\\1\end{bmatrix}\right) \end{align} And in the same way we can find out that \begin{align} \implies E_{\lambda = -1} &= \mathrm{span}\left(\begin{bmatrix}-1\\1\end{bmatrix}\right) \end{align} We can plot these:


<View source>.

Example 2

Sal presents a second example... A = \begin{bmatrix}-1 & 2 & 2 \\ 2 & 2 & -1 \\ 2 & -1 & 2\end{bmatrix} Remember, the e.v. equation is \mathrm{det}(\lambda I_n - A) = 0 , so in this case we have: |\lambda I_3 - A| = \begin{vmatrix}\lambda + 1 & -2 & -2 \\ -2 & \lambda - 2 & 1 \\ -2 & 1 & \lambda - 2 \end{vmatrix} = 0 We can use the rule of Sarrus to solve this: \begin{align} |\lambda I_3 - A| &= (\lambda + 1)(\lambda - 2)(\lambda - 2) + 4 + 4 \\ &\ \ \ - (\lambda + 1) - (4(\lambda - 2)) - (4(\lambda - 2)) \\ & = 0 \end{align} Simplifying this a little we get: \begin{align} |\lambda I_3 - A| &= \lambda^3 - 3\lambda^2 - 9\lambda + 27 \\ &= (\lambda - 3)(\lambda^2 - 9) \\ &= (\lambda - 3)(\lambda + 3)(\lambda - 3) \end{align} This means that the roots are either \lambda = 3 or \lambda = -3

As before, we know that E_\lambda = \mathrm{N}(\lambda I_n - A) .

For \lambda = 3 : E_\lambda = \mathrm{N}(\lambda I_n - A) \implies \begin{bmatrix}4 & -2 & -2\\ -2 & 1 & 1 \\ -2 & 1 & 1\end{bmatrix} \vec v = \vec 0 Solve using augmented matrix: \left[\begin{array}{ccc|c}4 & -2 & -2 & 0\\ -2 & 1 & 1 & 0 \\ -2 & 1 & 1 & 0\end{array}\right] \downarrow \left[\begin{array}{ccc|c}1 & -2/4 & -2/4 & 0\\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0\end{array}\right] Thus: v_1 = 0.5v_2 + 0.5v_3 From which we can say: E_{\lambda=+3} = \mathrm{span}\left(\begin{bmatrix}0.5 \\ 1 \\ 0\end{bmatrix}, \begin{bmatrix}0.5 \\ 0 \\ 1\end{bmatrix}\right) We can do the same for E_{\lambda=-3} , where we would get the result E_{\lambda=-3} = \mathrm{span}\left(\begin{bmatrix}-2 \\ 1 \\ 1\end{bmatrix}\right) We can note that E_{\lambda=3} and E_{\lambda=-3} are perpendicular. However, this need not be the case in general [Ref].

Showing That Eigen Basis Are Good Coorinate Systems

It can be shown that if a transformation matrix A has and least n L.I. eignevectors then we can form a basis B = \{\vec {v_1}, \ldots, \vec{v_n}\} where A\vec{v_1} = \lambda_i\vec{v_i} , such that the transfirmation in this new basis representation can be written as: D = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} Recall in the section on transformation matricies w.r.t a basis Sal Khan's summary graph: \begin{array}{ccc} \large\mathbf{\vec x} & \rightarrow & \small{A=CDC^{-1}}& \rightarrow & \large\mathbf{T\left(\vec x\right)} \\ \downarrow & & & & \downarrow \\ \small{C^{-1}} & & & & \small{C^{-1}}\\ \downarrow & & & & \downarrow \\ \large\mathbf{\left[\vec x\right]_B} & \rightarrow & \small{D=C^{-1}AC} & \rightarrow & \large\mathbf{\left[T\left(\vec x\right)\right]_B} \end{array} We can show this as follows: TODO

PCA

Really liked these YouTube tutorials by Data4Bio:

  1. Dimensionality Reduction: High Dimensional Data, Part 1
  2. Dimensionality Reduction: High Dimensional Data, Part 2
  3. Dimensionality Reduction: High Dimensional Data, Part 3
  4. Dimensionality Reduction: Principal Components Analysis, Part 1
  5. Dimensionality Reduction: Principal Components Analysis, Part 2
  6. Dimensionality Reduction: Principal Components Analysis, Part 3

These are notes based on these lectures... TODO