Linear Algebra Notes

Notes on some linear algebra topics. Sometimes I find things are just stated as being obvious, for example, the dot product of two orthogonal vectors is zero. Well... why is this? The result was a pretty simple but nonetheless it took a little while to think it through properly. Hence, these notes... they're my musings on the "why", even if the "why" might be obvious to the rest of the world! It also degraded into just notes too, without the musings :'(...

A lot of these notes are now just based on the amazing Khan Academy linear algebra tutorials.

Page Contents

Vector Stuff

Vector Spaces

References:

A vector is an element of a vector space. A vector space, also known as a linear space, V , is a set which has addition and scalar multiplication defined for it and satisfies the following for any vectors \vec u , \vec v , and \vec w and scalars c and d :

1 Closed under addition: \vec u + \vec v \in V
2 Communative: \vec u + \vec v = \vec v + \vec u
3 Associative: (\vec u + \vec v) + \vec w = \vec u + (\vec v + \vec w)
4 Additive identity: \vec 0 \in U, \text{i.e.,} \vec u + \vec 0 = \vec u
5 Inverse: \forall \vec u \in V, \exists -\vec u | \vec u + (-\vec u) = \vec 0
6 Closed under scalar multiply: c\vec u \in V
7 Distributive: c(\vec u + \vec v) = c\vec u + c\vec v
8 Distributive: (c + d)\vec u = c\vec u + d\vec u
9 Distributive: c(d\vec u) = (cd)\vec u
10 Multiply identity: 1 \vec u = \vec u

Functions Spaces

First let's properly define a function. If X and Y are sets and we have x \in X and y \in Y , we can form a set F that consists of ordered pairs (x_i, y_j) . If it is the case that (x_1, y_1) \in F \cap (x_1, y_2) \in F \implies y_1 == y_2 , then F is a function and we write y = F(x) . This is called being "single valued", i.e., each input to the function is unambiguous. Note, F(x) is not a function, it is the value returned by the function: F is the function that defines the mapping from elements in X (the domain) to the elements in Y (the range).

One example of a vector space is the set of real-valued functions of a real variable, defined on the domain [a \le x \le b] . I.e., we're talking about \{f_1, f_2, \ldots, f_n\} where each f_i is a function. In other words, f_i : \mathbb{R} \rightarrow \mathbb{R} with the aforementioned restriction on the domain.

We've said that a vector is any element in a vector space, but usually we thing of this in terms of, say, [x, y] \in \mathbb{R}^2 . We can also however, given the above, think of functions as vectors: the function-as-a-vector being the "vector" of ordered pairs that make up the mapping between domain and range. So in this case the set of function is still like a set of vectors and with scalar multiplication and vector addition defined appropriately the rules for a vector space still hold!

What Isn't A Vector Space

It sounded a lot like everything was a vector space to me! So I did a quick bit of googling and found the paper "Things that are/aren't vector spaces". It does a really good example of showing how in maths we might build up a model of something and expect it to behave "normally" (i.e., obeys the vector space axioms), but in fact doesn't, in which case it becomes a lot harder to reason about them. This is why vectors spaces are nice... we can reason about them using logic and operations we are very familiar with and feel intuitive.

Linear Combinations

A linear combination of vectors is one that only uses the addition and scalar multiplication of those variables. So, given two vectors, \vec a and \vec b the following are linear combinations: 0\vec a + 0\vec b \\ 5\vec a - 2.5\vec b \\ -123\vec a + \pi\vec b This can be generalise to say that for any set of vectors \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}\} \text{ for } \vec{v_1}, \vec{v_2}, \ldots, \vec{v_n} \in \mathbb{R}^m that a linear combination of those vectors is c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_n\vec{v_n} \text{ for } c_i \in \mathbb{R} .

Spans

graph showing span of two vectors in 2D space The set of all linear combinations of a set of vectors \{\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}\} is called the span of those vectors: \mathrm{span}(\vec{v_1}, \vec{v_2}, \ldots, \vec{v_n}) = \{c_1\vec{v_1} + c_2\vec{v_2} + \cdots + c_n\vec{v_n} \;|\; c_i \in \mathbb{R}\} In the graph to the right there are two vectors \vec a and \vec b . The dotted blue lines show all the possible scalings of the vectors. The vectors span those lines. The green lines show two particular scalings of \vec a added to all the possible scalings of \vec b . One can use this to imagine that if we added all the possible scalings of \vec a to all the possible scalings of \vec b we would reach every coordinate in \mathbb{R}^2 . Thus, we can say that the set of vectors \{\vec a, \vec b\} spans \mathbb{R}^2 , or \mathrm{span}(\vec a, \vec b) = \mathbb{R}^2 .

From this we may be tempted to say that any set of two, 2d, vectors spans \mathbb{R}^2 but we would be wrong. Take for example \vec a and \vec b = -\alpha\vec a . These two variables are co-linear. Because they span the same line, no combination of them can ever move off that line. Therefore they do not span \mathbb{R}^2 .

Lets write the above out a little more thoroughly... \vec a = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} , \vec b = \begin{bmatrix} -\alpha a_1 \\ -\alpha a_2 \end{bmatrix} The span of these vectors is, by our above definition: \mathrm{span}(\vec a, \vec b) = \left\{ c_1 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} + c_2 \begin{bmatrix} -\alpha a_1 \\ -\alpha a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} Which we can re-write as: \begin{align} \mathrm{span}(\vec a, \vec b) &= \left\{ c_1 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} -\alpha c_2 \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} \\ &= \left\{ (c_1 -\alpha c_2) \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \mid c_i \in \mathbb{R} \right\} \\ \end{align} ... which we can clearly see is just the set of all scaled vectors of \vec a . Plainly, any co-linear set of 2d vectors will not span \mathbb{R}^2 .

Linear Independence

We saw above that the set of vectors \{\vec a, -\alpha\vec a\} are co-linear and thus did not span \mathbb{R}^2 . The reason for this is that they are linealy dependent, which means that one or more vectors in the set are just linear combinations of other vectors in the set.

For example, the set of vectors \{[1, 0], [0, 1], [2, 3]\} spans \mathbb{R}^2 , but [2, 3] is redundant because we can remove it from the set of vectors and still have a set that spans \mathbb{R}^2 . In other words, it adds no new information or directionality to our set.

A set of vectors will be said to be linearly dependent when: c_1\vec{v_1} + \cdots + c_n\vec{v_n} = \vec 0 \mid \exists \{c_i \mid c_i \ne 0 \} Thus a set of vectors is linearly independent when the following is met: c_1\vec{v_1} + \cdots + c_n\vec{v_n} = \vec 0 \mid \forall \{c_i \mid c_i = 0 \} Sure, this is a definition, but why does this mean that the set is linearly dependent?

Picture of linearly indpendent vectors

Think of vectors in a \mathbb{R}^2 . For any two vectors that are not co-linear, there is no combination of those two vectors that can get you back to the origin unless they are both scaled by zero. The image to the left is trying to demonstrate that. If we scale \vec a , no matter how small we make that scaling, we cannot find a scaling of \vec b that when added to \vec a will get us back to the origin. Thus we cannot find a linear combination c_av_a + c_bv_b = \vec 0 where c_a and/or c_b does not equal zero.

This shows that neither vector is a linear combination of the other, because if it where, we could use a non-zero scaling of one, added to the other, to get us back to the origin.

We can also show that two vectors are linearly independent when their dot product is zero. This is covered in a latter section.

Linear Subspace

If V is a set of vectors in \mathbb{R}^n , then V is a subspace of \mathbb{R}^n if and only if:

  1. V contains the zero vector: \vec 0 \in V
  2. V is closed under scalar multiplication: \vec x \in V, \alpha \in \mathbb{R} \implies \alpha\vec x \in V
  3. V is closed under addition: \vec a \in V, \vec b \in V \implies \vec a + \vec b \in V

What these rules mean is that, using linear operations, you can't do anything to "break" out of the subspace using vectors from that subspace. Hence "closed".

Surprising, at least for me, was that the zero vector is actually a subspace of \mathbb{R}^n because the set \{\vec 0\} obeys all of the above rules.

A very useful thing is that a span is always a valid subspace. Take, for example, V = \mathrm{span}(v_1, v_2, v_3) . We can show that it is a valid subspace by checking that it objects the above 3 rules.

First, does it contain the zero vector? Recall that a span is the set of all linear combinations of the span vectors. So we know that the following linear combination is valid, and hence the span contains the zero vector. 0\vec v_1 + 0\vec v_2 + 0\vec v_3 = \vec 0 Second, is it closed under scalar multiplication? Because the span is the set of all linear combinations, for any set of scalars \{a_1, a_2, a_3\} ... a_1\vec v_1 + a_2\vec v_2 + a_3\vec v_3 = \vec x ... must also be in the subspace, by definition of a subspace!

Third, is it closed under vector addition? Well, yes, the same argument as made above would apply.

Interestingly, using the same method as above, we can see that even something trivial, like the \mathrm{span}([a_1, a_2]) , which is just a line, is a subspace of \mathbb{R}^2 .

Basis Of A Subspace

Recall, V = \mathrm{span}(\vec v_1, \ldots, \vec v_n) V is a subspace of \mathbb{R}^n (where each vector \vec v_i \in \mathbb{R}^n ). We also know that a span is the set of all possible linear combinations of these vectors.

If we choose a subset of V , S , where all the vectors in S are linearly independent, an we can write any vector in V as a linear combination of vectors in S , then we say that S is a basis for V .

S acts as a "minimal" set of vectors required to span V . I.e., if you added any other vector to S it is superfluous, because it would be a linear combination of the vectors already in S . You don't add any new information.

A subspace can have an infinite number of basis, but there are standard basis. For example in \mathbb{R}^2 the standard basis is \{[1, 0], [0, 1]\}

A basis is a good thing because we can break down any element in the space that the basis defines into a linear combination of basic building blocks, i.e. the basis vectors.

TODO discuss the canonical vectors in R2 and R3. [1,0] , [0,1] ... but we could also have [1,0] , [1,1] .

TODO unique coefficients. implied by the linear independence

TODO most important basis are othogonal and orthonormal.

Orthonormal are good because we can find the coefficient in the new basis easily: \alpha_k = \vec w_k \dot x

The Dot Product

This is a good reference.

The dot product is one definition for a vector multiplication that results in a scalar. Given two equal-length vectors, the dot product looks like this: \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \cdot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = a_1b_1 + a_2b_2 + \cdots + a_nb_n Written more succinctly we have: x \cdot y = \sum_1^n {x_i y_i} The dot product is important because it will keep popping up in many places. Some particular uses are to calculate the length of a vector and to show that two vectors are linearly independent. The operation has the following properties:

Commutative: \vec v \cdot \vec w = \vec w \cdot \vec v
Distributive: (\vec v + \vec w) \cdot \vec x = \vec v \cdot \vec x + \vec w \cdot \vec x
Associative (c\vec v) \cdot \vec w = c(\vec v \cdot \vec w)
Cauchy-Schwartz Inequality \vec x \cdot \vec y \le |\vec x \cdot \vec y| \le \|\vec x\| \|\vec y\|
|\vec x \cdot \vec y| = \|\vec x\| \|\vec y\| \implies \vec x = c \vec y , where c is non-zero.
Vector Triangle Inequality \|\vec x + \vec y\| \le \|\vec x\| + \|\vec y\|

Dot product defines the angle between vectors

The dot product also defines the angle between vectors. Take two unit vectors \hat a and \hat b . We can calculate: \text{angle between}\ \hat a\ \text{and}\ \hat b = \theta = \theta_a - \theta_b We know that: \cos(\theta_a) = a_x \;\text{and}\; \sin(\theta_a) = a_y \\ \cos(\theta_b) = b_x \;\text{and}\; \sin(\theta_b) = b_y \\ And we know that: \cos(\theta) = \cos(\theta_a - \theta_b) = \cos(\theta_a)\cos(\theta_b) + \sin(\theta_a)\sin(\theta_b) Therefore, by substitution: \begin{align} \cos(\theta) &= a_xb_x + a_yb_y \\ &= a \cdot b \end{align} Thus, \cos(\theta) = a \cdot b describes the angle between unit vectors. If the vectors are not unit vectors then the angle is defined, in general, by: \cos(\theta) = \frac{1}{\|\vec a\|\|\vec b\|} \vec a \cdot \vec b

Unit Length Vectors

A unit vector is a vector that has a length of 1. If \hat u is a vector (notice the special symbol used) then we defined its length as: \mathrm{length}(\hat u) = \|\hat u\| = \sqrt{u_1^2 + u_2^2 + \cdots + u_n^2} Thus if a vector is a unit vector then \|\hat u\| = 1 .

Interesting we can also write \|\vec u\|^2 = u \dot u .

To construct a unit vector where \sqrt{u_1^2 + u_2^2 + \cdots + u_n^2} = 1 , we need the sum u_1^2 + u_2^2 + \cdots + u_n^2 to also be 1! How do we acheive this if \|\vec u\| \ne 1 ? We do the following: \|\hat{u}\| = \frac{1}{\|\vec u\|}\vec u Assuming of course $\|\vec u\| \ne 0.

Orthogonal Vectors and Linear Independence

Image showing 3 perpendicular vectors
Figure 1

Orthogonal vectors are vectors that are perpendicular to each other. In a standard 2D or 3D graph, this means that they are at right angles to each other and we can visualise them as seen in Figure 1: \vec x is at 90 degrees to both \vec y and \vec z . \vec y is at 90 degrees to \vec x and \vec z , and \vec z is at 90 degrees to \vec y and \vec z .

In any set of any vectors, [\vec{a_1}, ..., \vec{a_n}] , the vectors are said to be linearly linearly independent if every vector in the set is orthogonal to every other vector in the set.

So, how do we tell if one vector is orthogonal to another? The answer is the dot product which is defined as follows. x \cdot y = \sum_1^n {x_i y_i}

We know when two vectors are orthogonal when their dot product is zero: x \cdot y = 0 \implies x and y are orthogonal.

But why is this the case? Lets imagine any two aribtrary vectors, each on the circumference of a unit circle (so we know that they have the same length and are therefore a proper rotation of a vector around the centre of the circle). This is shown in figure 2. From the figure, we know the following:

Picture showing a vector being rotated
Figure 2

\begin{align} x_1 &= r \cos{\theta_1}& y_1 &= r \sin{\theta_1} \\ x_2 &= r \cos({\theta_1 + \theta_n}) & y_2 &= r \sin({\theta_1 + \theta_n}) \end{align} The vector x_2 is the vector x_1 rotated by \theta_n degrees. We can use the following trig identities: \begin{align} \sin(a \pm b)& = \sin(a)\cos(b) \pm \cos(a)\sin(b) \\ \cos(a \pm b)& = \cos(a)\cos(b) \mp \sin(a)\sin(b) \end{align} Substitute these into the formauls above, we get the following. \begin{align} x_2 &= r\cos\theta_1\cos\theta_n - r\sin\theta_1\sin\theta_n \\ y_2 &= r\sin\theta_1\cos\theta_n + r\cos\theta_1\sin\theta_n \\ \end{align} Which means that... \begin{align} x_2 &= x_1\cos\theta_n - y_1\sin\theta_n \\ y_2 &= y_1\cos\theta_n + x_1\sin\theta_n \end{align} For a 90 degree rotation \theta_n = 90^\circ we know that \cos\theta_n = 0 and \sin\theta_n = 1 . Substiuting these values into the above equations we can clearly see that... \begin{align} x_2 &= -y_1 \\ y_2 &= x_1 \end{align} Therefore, any vector in 2D space that is [a, b] will become [-b, a] and therefore the dot product becomes -ab + ab = 0 . And voilà, we know know why the dot product of two orthogonal 2D vectors is zero. I'm happy to take it on faith that this extrapolates into n dimensions :)

Another way of looking at this is to consider the law of cosines: C^2 = A^2 + B^2 - 2AB\cos(c) :

Law of cosines

Here A and V represent the magnitude of the vectors \vec x , \|\vec x\| , and \vec y , \|\vec y\| . C represents the magnitude of the vector \|\vec x - \vec y\| . Thus, we have this:

Law of cosines

I.e., \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 - 2\|\vec x\|\|\vec y\|\cos(\theta_c) .

When the angle between the vectors is 90 degrees, \cos(90) = 0 and so that part of the equation dissapears to give us \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 . Now... \|\vec x - \vec y\|^2 = \|\vec x\|^2 + \|\vec y\|^2 \\ \therefore \|\vec x\|^2 + \|\vec y\|^2 - \|\vec x - \vec y\|^2 = 0 Recall that \|\vec v\| = \sqrt{v_1^2 + v_2^2 + ... + v_n^2} and that therefore \|\vec v\|^2 = v_1^2 + v_2^2 + ... + v_n^2 .

Based, on this we can say that... \begin{align} \|\vec x - \vec y\|^2 &= (x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2 \\ &= x_1^2 - 2x_1y_1 + y_1^2 + x_2^2 - 2x_2y_2 + y_2^2 + ... + x_n^2 - 2x_ny_n + y_n^2 \end{align} Now, all of the x_i^2 and y_i^2 terms above will be cancelled out by the \|\vec x\|^2 and \|\vec y\|^2 terms, leaving us with... \|\vec x\|^2 + \|\vec y\|^2 - \|\vec x - \vec y\|^2 = 2x_1y_1 + 2x_2y_2 + ... + 2x_ny_n = 0 Which is just twice the dot product when the vectors are at 90 degrees - or just the dot product because we can divide through by 2.

Matrices

Properties

Khan Academy has great articles. This is another good reference from MIT.

Scalar Multiplication

  1. Associative: (cd)A=c(dA)
  2. Distributive: c(A + B)=cA + cB and (c + d)A = cA + dA
  3. Identity: 1A = A
  4. Zero: 0A = O
  5. Closure: cA is a matrix of the same dimensions as A .

Note that scalar multiplication is not commutative: \color{red}{cA \ne Ac} !

Matrix Addition

  1. Commutative: A+B=B+A
  2. Associative: A+(B+C)=(A+B)+C
  3. Zero: For any matrix A , there is a unique matrix O such that A+O=A
  4. Inverse: \forall A,\ \exists! −A\ \mid\ A+(−A)=O
  5. Closure: A+B , is a matrix of the same dimensions as A and B .

Matrix Multiplication

  1. Associative: (AB)C=A(BC)
  2. Distributive: A(B+C)=AB+AC
  3. Identity: IA=A
  4. Zero: OA=O
  5. Inverse: Only if A is a square matrix, then A^{-1} is its inverse and AA^{-1} = A^{-1}A = I . Note that not all matrices are invertible.
  6. Dimension: If A is an m \times n matrix and B is an n \times k matrix then AB is an m \times k matrix.

Note that matrix multiplication is not commutative: \color{red}{AB \ne BA} !

Be careful: the equation AB = 0 does not necessarily mean that A = 0 or B = 0 and for a general matrix A , AB = AC \nRightarrow B = C , unless A is invertible.

Matrix Transpose

  1. Associative: (rA)^T = rA^T
  2. Distributive (into addition): (A + B)^T = A^T + B^T
  3. Distributive (into mult): (AB)^{T} = B^{T}A^{T}\ \color{red}{\blacktriangleleft \text{Note the reversal of terms}}
  4. Identity: I^T = I
  5. Inverse: (A^T)^T = A

Matrix Inverse

  1. Associative (self with inverse): AA^{−1} = A^{−1}A = I
  2. Distributive (into scalar mult): (rA)^{−1} = r^{−1}A^{−1},\ r \ne 0
  3. Distributive (into mult): (AB)^{-1} = B^{-1}A^{-1}\ \color{red}{\blacktriangleleft \text{Note the reversal of terms}}
  4. Identity: I^{-1} = I
  5. Name?: (A^{-1})^{-1} = A
  6. Name?: (A^T)^{-1} = (A^{-1})^T

Matricies And Simultaneous Linear Equations

References:

The set of linear equations: a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = b_1\\ a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = b_2\\ ... \\ a_{m1}x_1 + a_{m2}x_2 + ... + a_{mn}x_n = b_m Can be represented in matrix form: \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \\ \end{bmatrix} Any vector \vec x , where A\vec x = \vec b , is a solution to the set of simultaneous linear equations. How can we solve this from the matrix. We create an augmented or patitioned matrix: \left[ \begin{array}{cccc|c} 1a_{11} & a_{12} & \cdots & a_{1n} & b_1\\ a_{21} & a_{22} & \cdots & a_{2n} & b_2 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} & b_m\\ \end{array} \right] Next this matrix must be converted to row echelon form. This is where each row has the same or less zeros to the right than the row below. For example, the following matrix is in row echelon form: \begin{bmatrix} 1 & 2 & 3 & 4 \\ 0 & 1 & 2 & 3 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 2 \\ \end{bmatrix} But, this next one is not: \begin{bmatrix} 1 & 2 & 3 & 4 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 2 \\ 0 & 1 & 2 & 3 \\ \end{bmatrix} We go from a matrix to the equivalent in row-echelon form using elementary row operations:

  1. Row interchange: R_i \leftrightarrow R_j
  2. Row scaling: R_i \rightarrow kR_i
  3. Row addition: R_i \rightarrow R_i + kR_j

And woop, woop, even more important than echelon form is reduced row-echelon form (rref). This is when:

  1. If a row has all zeros then it is below all rows with a non-zero entry.
  2. The left most entry in each row is 1 - it is called the pivot.
  3. The left most non-zero entry in each row is the only non-zero in that column.
  4. Each row has more zeros to the left of the pivit than the row above.

So why do we give a damn above rref if we have row-echelon form? Well, one reason is it lets us figure out if a set of linear equations has a solution and which variables are free and which are dependent.

The free variable is one that you have control over, i.e. you can choose and manipulate its value. The dependent one depends on the values of the other variables in the experiment or is constant.

Pivot columns represent variables that are dependent. Why? Recall that the column vector containing a pivot is all-zeros except for the pivot. So if element j in a row of a rref matrix is a pivot, then in our system of equations x_j only appears in this one equations in our rref matrix. Thus, its value must be determined as this is the only place it occurs. It must either be a constant or a combination of other variables.

If we have an augmented matrix [A | b] :

  1. No Solution. If the last (augmented) column of [A|b] (ie. the column which contains b ) contains a pivot entry.
  2. Exactly One Solution. If the last (augmented) column of [A|b] (ie. the column which contains b ) does not contain a pivot entry and all of the columns of the coefficient part (ie. the columns of A ) do contain a pivot entry.
  3. Infinitely Many Solutions. If the last (augmented) column of [A|b] (ie. the column which contains b ) does not contain a pivot entry and at least one of the columns of the coefficient part (ie. the columns of A ) also does not contain a pivot entry.

We can find the reduced row echelon form of a matrix in python using the sympy package and the rref() function:

from sympy import Matrix, init_printing

init_printing(use_unicode=True)

system = Matrix(( (3, 6, -2, 11), (1, 2, 1, 32), (1, -1, 1, 1) ))
system
system.rref()[0]

Which outputs:

⎡1  0  0  -17/3⎤
⎢              ⎥
⎢0  1  0  31/3 ⎥
⎢              ⎥
⎣0  0  1   17  ⎦

To then apply this to our augmented matrix we could do something like the following. Lets say we have a set of 3 simultaneous equations with 3 unknowns: 3x + 6y - 2z = 11 \\ x + 2y + z = 32 \\ x - y + z = 1 We'd make an augmented matrix like so: \left[ \begin{array}{ccc|c} 3 & 6 & -2 & 11\\ 1 & 2 & 1 & 32\\ 1 & -1 & 1 & 1 \end{array} \right] To solve this using Python we do:

from sympy import Matrix, solve_linear_system, init_printing
from sympy.abc import x, y, z

init_printing(use_unicode=True)

system = Matrix(( (3, 6, -2, 11), (1, 2, 1, 32), (1, -1, 1, 1) ))
system

solve_linear_system(system, x, y, z)

Which will give the following output:

⎡3  6   -2  11⎤
⎢             ⎥
⎢1  2   1   32⎥
⎢             ⎥
⎣1  -1  1   1 ⎦
{x: -17/3, y: 31/3, z: 17}

Matrix / Vector Multiplication

Matrix/vector multiplication is pretty straight forward: \begin{bmatrix} a & c \\ b & d \\ \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 2a + 4c \\ 2b + 4d \end{bmatrix}

But having said that, I'd never really thought of it as shown below until watching some stuff on Khan Academy:

\begin{bmatrix} \color{red}a & \color{blue}c \\ \color{red}b & \color{blue}d \\ \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = 2 \begin{bmatrix} \color{red}a \\ \color{red}b \end{bmatrix} + 4 \begin{bmatrix} \color{blue}c \\ \color{blue}d \end{bmatrix}

We can think of it as the linear combination of the column vectors of the matrix where the coefficients are the members of the vector, where the first column vector is multiplied by the first element, the second column vector by the second element and so on.

So, for any mxn matrix, A , and vector \vec v \in \mathbb{R}^n we can say that, A\vec v = \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vdots & \vdots & & \vdots\\ \vec{c_1} & \vec{c_2} & \cdots & \vec{c_n}\\ \vdots & \vdots & & \vdots\\ \downarrow & \downarrow & & \downarrow \end{bmatrix} \begin{bmatrix} {v_1} \\ {v_2} \\ \vdots \\ {v_n} \end{bmatrix} = v_1\vec{c_1} + v_2\vec{c_2} + \cdots + v_n\vec{c_n}

Null Space Of A Matrix

For an mxn vector A , the nullspace of the matrix A , N(A) is defined as follows: N(A) = \{\vec x \in \mathbb{R}^n \shortmid A\vec x = \vec 0 \} The null space of a matrix is a valid subspace of \mathbb{R}^n .

It can also be shown that N(A) = N(\mathrm{RREF}(A)) , where \mathrm{RREF} is the reduced row echelon form.

The nullspace of a matrix can be related to the linear independence of it's column vectors. Using the above definition of nullspace, we can write: A\vec v = \begin{bmatrix} \uparrow & \uparrow & & \uparrow\\ \vdots & \vdots & & \vdots\\ \vec{a_1} & \vec{a_2} & \cdots & \vec{a_n}\\ \vdots & \vdots & & \vdots\\ \downarrow & \downarrow & & \downarrow \end{bmatrix} \begin{bmatrix} {x_1} \\ {x_2} \\ \vdots \\ {x_n} \end{bmatrix} = \begin{bmatrix} 0_1 \\ 0_2 \\ \vdots \\ 0_m \end{bmatrix} Recalling how we can write out vector/scalar multipication, we get, x_1\vec{a_1} + x_2\vec{a_2} + \cdots + x_n\vec{a_n} = \vec 0 Recall from the section on linear independence that the above set of vectors is said to be linearly dependent when: x_1\vec{a_1} + x_2\vec{a_2} + \cdots + x_n\vec{a_n} = \vec 0 \mid \forall \{c_i \mid c_i = 0 \} Given that our nullspace is the set of all vectors that satisfy the above, by definition, then we can say that the column vectors of a matrix A are LI if and only if N(A) = {\vec 0} .

Orthogonal Matrices

So, onto orthogonal matrices. A matrix is orthogonal if AA^T = A^TA = I . If we take a general matrix and multiply it by it's transpose we get...

AA^T = \begin{pmatrix} a & c \\ b & d \\ \end{pmatrix} \begin{pmatrix} a & b \\ c & d \\ \end{pmatrix} = \begin{pmatrix} aa + cc & ab + cd \\ ab + cd & bb + dd \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ \end{pmatrix}

The pattern ab + cd looks pretty familiar right?! It looks a lot like x_1x_2 + y_1y_2 , our formula for the dot product of two vectors. So, when we say a=x_1 , b=x_2 , c=y_1 and d=y_2 , we will get the following: a matrix of two row vectors \vec v_1 = [x_1, y_1] , \vec v_2 = [x_2, y_2] . A = \begin{pmatrix} \vec v_1 \\ \vec v_2 \\ \end{pmatrix} = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \therefore AA^T = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + y_1^2 & x_1x_2 + y_1y_2 \\ x_1x_2 + y_1y_2 & x_2^2 + y_2^2 \\ \end{pmatrix} If our vectors \vec v_1 and \vec v_2 are orthogonal, then the component x_1x_2 + y_1y_2 must be zero. This would give us the first part of the identify matrix pattern we're looking for.

The other part of the identity matrix would imply that we would have to have x_1^2 + y_1^2 = 1 and x_2^2 + y_2^2 = 1 ... which are just the formulas for the square of the length of a vector. Therefore, if our two vectors are normal (i.e, have a length of 1), we have our identity matrix.

Does the same hold true for A^TA ? It doesn't if we use our original matrix A!... A^TA = \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + x_2^2 & x_1y_1 + x_2y_2 \\ x_1y_1 + x_2y_2 & y_1^2 + y_2^2 \\ \end{pmatrix} Oops, we can see that we didn't get the identity matrix!! But, perhaps we can see why. If A was a matrix of row vectors then A^T is a matrix of column vectors. So for AA^T we were multiplying a matrix of row vectors with a matrix of column vectors, which would, in part, give us the dot products as we saw. So if we want to do A^TA it would follow that for this to work A now was to be a matrix of column vectors because we get back to our original ( A^T would become a mtrix of row vectors): A^TA = \begin{pmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \end{pmatrix} \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \\ \end{pmatrix} = \begin{pmatrix} x_1^2 + y_1^2 & x_1x_2 + y_1y_2 \\ x_1x_2 + y_1y_2 & x_2^2 + y_2^2 \\ \end{pmatrix}

So we can say that if we have matrix who's rows are orthogonal vectors and who's columns are also orthogonal vectors, then we have an orthogonal matrix!

Okay, thats great n' all, but why should we care? Why Are orthogonal matricies useful?. It turns our that orthogonal matricies preserve angles and lengths of vectors. This can be useful in graphics to rotate vectors but keep the shape they construct, or in numerical analysis because they do not amplify errors.

Determinants

This is one of the better YouTube tutorials I've found that explains the concepts and not just the mechanics behind determinants, byt 3BlueBrown (awesome!).

Eigenvectors and Eigenvalues

Stackoverflow thread: What is the importance of eigenvalues and eigenvectors.

another

Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction, ... An eigenvalue is a number, telling you how much variance there is in the data in that direction ...

Linear Transformations

Functions

Normally a function is a mapping between two sets: from a domain to a range, which is a subset of the function's co-domain. I.e., the co-domain is the set of all possible things the function could map onto and the range is the subset of the co-domain containing the things the function actually maps to.

co-domain vs range

The usual function notation looks like this: f(x) = x^2 . Note that f is the function, not f(x) , which is the value output by the function, f , when x is input. There are other notations we use to denote functions.

For example the function f(x) = x^2 can be described as a mapping from the set of all real numbers to the set of all real numbers. f: \mathbb{R} \rightarrow \mathbb{R} It can also be defined using this notation: f: x \mapsto x^2

A function is not, however, a simple mapping between two sets. For a mapping to be a function the following must be true: f(x) = y_1 \cap f(x) = y_2 \implies y_1 = y_2 . In other words one item in the domain cannot map to more than one item in the range, but two distinct items in the domain can map to the same item in the range.

a function vs a mapping

Vector Valued Functions

Functions that map to \mathbb{R} are called scalar values functions, or specifically in this case, real valued functions.

If the co-domain of a function is in \mathbb{R}^m , where m > 1 , then we say that the function is vector valued. Functions, where the domain is in \mathbb{R}^m are called transformations.

A vector values function could like something like this, for example: f\left( \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \right) = \begin{bmatrix} 2\pi x_1 + \frac{5}{2}x_3\\ 4x_2 \end{bmatrix} It could look like anything... th epoint is because f operates on a vector, we call it a transformation.

Linear Transformations

Vector valued functions are transformations. Linear transformations are a particular type of transformation that obeys thee following rules. If T: \mathbb{R}^n \mapsto \mathbb{R}^m and \vec a, \vec b \in \mathbb{R}^n then:

  1. T(\vec a + \vec b) = T(\vec a) + T(\vec b)
  2. T(\alpha \vec a) = \alpha T(\vec a) \mid \alpha \in \mathbb{R}
  3. T(\vec 0) = \vec 0

In other words, the origin must remain fixed and all lines must remain lines.

Matrices are used to do the linear transformations so the function T will generally be defined as T(\vec v) = A\vec v , where A is the transformation matrix. 2D vectors transform 2D space, 3D vectors transform 3D space and so on...

Interestingly the column vectors of the transformation vector A , tell us what happens to the canonical basis vectors in that space, and from this we can work out how any vector would be transformed.

Graph showing the linear transformation of the canonical vectors for R^2

We have seen that in cartesian space, \mathbb{R}^2 , for any coordinate, (x, y) , the numbers x and y are the coefficients of the canonical basis vectors... \begin{bmatrix} x \\ y \end{bmatrix} = x \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix} Lets say our transformation matrix has the following properties: T : \begin{bmatrix} 1 \\ 0 \end{bmatrix} \mapsto \begin{bmatrix} a \\ b \end{bmatrix} \\ T : \begin{bmatrix} 0 \\ 1 \end{bmatrix} \mapsto \begin{bmatrix} c \\ d \end{bmatrix} Well, we've seen we can re-write any vector as a combination of the above basis vectors, so we can write: \begin{align} T\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) &\mapsto T\left(x \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix}\right)\\ & \mapsto xT\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) + yT\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right) \\ & \mapsto \begin{bmatrix} \color{red}ax + \color{blue}by \\ \color{red}cx + \color{blue}dy \end{bmatrix} \end{align} Which looks suspiciously like a matrix, vector product. If we describe the transformation, T as follows: T = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} Then... T\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} \color{red}ax + \color{blue}by \\ \color{red}cx + \color{blue}dy \end{bmatrix} An so we can see that the first column vector of T shows us where the first canonical basis vector [1, 0] maps to and the second column vector of T shows us where the second canonical basis vector [0, 1] maps to... woop! In other words, just to be clear... T = \begin{bmatrix} \color{red}a & \color{blue}b \\ \color{red}c & \color{blue}d \end{bmatrix} = \begin{bmatrix} T\left( \begin{bmatrix} 1 \\ 0 \end{bmatrix} \right) & T\left( \begin{bmatrix} 0 \\ 1 \end{bmatrix} \right) \end{bmatrix} We can also note that this the same as taking he transform of each of the column vectors of the identity matrix!

We can see this more "solidly" in the sketch below...

Graph showing the linear transformation of the canonical vectors for R^2

We can see how one coordinate can be represented using the canonical Cartesian coordinates, but also coordinates in our transformed space. Thus the same "point" can be represented using different coordinates (coefficients of the basis vectors) for different transform spaces.

Another important point is that it can be shown that all matrix-vector products are linear transformations!

Linear transformations also have these important properties, if we take two transformations S \mathbb{R}^n \mapsto \mathbb{R}^m and T \mathbb{R}^n \mapsto \mathbb{R}^m , where S has the transformation matrix A and T has the transformation matrix B :

  • Transformation addition: (S+T)(\vec x) = S(\vec x) + T(\vec x) = (A+B)\vec x .
  • Scalar multiplication: (cS)(\vec x) = cS(\vec x) = (cA)\vec x .

Graph showing a linear transformation doing rotations

Projections

Vector projections

The graph to the left shows a vector \vec x and its projection onto the line L , defined by the vector \vec v . The projection of \vec x onto L describes how much of \vec x goes in the direction of L .

We write "the projection of \vec x onto L like so: \textrm{Proj}_L(\vec x) The projection is found by dropping a line, perpendicular to L from the top of \vec x . Where this line hits L is the end of the vector that describes the projection. This is shown on the graph by the smaller green vector labelled \vec x - \textrm{Proj}_L .

Because \vec x - \textrm{Proj}_L(\vec x) is perpendicular to L we can say the following (recall perpendicular vector's dot product is zero): (\vec x - \textrm{Proj}_L(\vec x)) \cdot \vec v = 0,\ \text{where}\ \vec v\ \text{is a defining vector of the line L} \\ \therefore (\vec x - c\vec v) \cdot \vec v = 0, \text{because}\ c\vec v \in L\ \text{and}\ \textrm{Proj}_L(\vec x) \in L\ \text{it means, for some}\ c,\ c\vec v = \textrm{Proj}_L(\vec x) \\ \therefore c = \frac{\vec v \cdot \vec x}{\|\vec v\|^2} Thus the projection is defined as: \textrm{Proj}_L(\vec x) = \frac{\vec v \cdot \vec x}{\|\vec v\|^2} \vec v,\ \text{such that}\ \vec v \in L And if \vec v is a unit vector \hat v , then this simplifies to: \textrm{Proj}_L(\vec x) = (\hat v \cdot \vec x) \hat v And we have seen in a previous section how to make an vector a unit vector. Whoopdi do!

Turns out that projection is a linear transform! To calculate the matrix that describes the projection transform, apply the projection to each of the column vectors of the identity matrix where the column vectors of I are the same length as the vector over which the projection is applied. So, for example, vectors in \mathbb{R}^n to get the transformation matrix A for a projection onto a line defined by the vector \hat u : A = \begin{bmatrix} \textrm{Proj}_L(I_{\text{col 1}}) & \textrm{Proj}_L(I_{\text{col 2}}) & \cdots & \textrm{Proj}_L(I_{\text{col n}}) \end{bmatrix}

PCA

Really liked these YouTube tutorials by Data4Bio:

  1. Dimensionality Reduction: High Dimensional Data, Part 1
  2. Dimensionality Reduction: High Dimensional Data, Part 2
  3. Dimensionality Reduction: High Dimensional Data, Part 3
  4. Dimensionality Reduction: Principal Components Analysis, Part 1
  5. Dimensionality Reduction: Principal Components Analysis, Part 2
  6. Dimensionality Reduction: Principal Components Analysis, Part 3