Linear Algebra refresher

In this lecture we will go through some of the key concepts of linear algebra and inverse problem theory that are required to develop the theories of the different machine learning algorithm presented in this course. This is not meant to be an exhaustive treatise and students are strongly advised to take the ErSE 213 - Inverse Problems prior to this course.

Three key mathematical objects arise in the study of linear algebra:

Scalars: aRa \in \mathbb{R}, a single number represented by a lower case italic letter;

Vectors: x=[x1,x2,...,xN]TRN\mathbf{x} = [x_1, x_2, ..., x_N]^T \in \mathbb{R}^N, ordered collection of NN numbers represented by a lower case bold letter; it is sometimes useful to extract a subset of elements by defining a set S\mathbb{S} and add it to as a superscript, xS\mathbf{x}_\mathbb{S}. As an example, given x=[x1,x2,x3,x4,x5,x6]TR6\mathbf{x} = [x_1, x_2, x_3, x_4, x_5, x_6]^T \in \mathbb{R}^6 and S=1,3,5\mathbb{S} = {1, 3, 5} we can define the vector xS=[x1,x3,x5]\mathbf{x}_\mathbb{S} = [x_1, x_3, x_5] and its complementary vector xS=[x2,x4,x6]\mathbf{x}_{-\mathbb{S}} = [x_2, x_4, x_6]

Matrices: XR[N×M]\mathbf{X} \in \mathbb{R}^{[N \times M]}, two dimensional collection of numbers represented by an upper case bold letter where NN and MM are referred to as the height and width of the matrix. More specifically a matrix can be written as

X=[x1,1x1,2x1,M.........xN,1xN,2xN,M]\mathbf{X} = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,M} \\ ... & ... & ... \\ x_{N,1} & x_{N,2} & x_{N,M} \end{bmatrix}

A matrix can be indexed by rows Xi,:\mathbf{X}_{i, :} (i-th row), by columns X:,j\mathbf{X}_{:, j} (j-th column), and by element Xi,j\mathbf{X}_{i, j} (i-th row, j-th column). A number of useful operations that are commonly applied on vectors and matrices are now described:

  • Transpose: Y=XT\mathbf{Y} = \mathbf{X}^T, where Yi,j=Xj,iY_{i, j} = X_{j, i}
  • Matrix plus vector: Y[N×M]=X[N×M]+z[1×M]\mathbf{Y}_{[N \times M]} = \mathbf{X}_{[N \times M]} + \mathbf{z}_{[1 \times M]}, where Yi,j=Xi,j+zjY_{i, j} = X_{i, j} + z_{j} (z\mathbf{z} is added to each row of the matrix X\mathbf{X})
  • Matrix-vector product: y[N×1]=A[N×M]x[M×1]\mathbf{y}_{[N \times 1]} = \mathbf{A}_{[N \times M]} \mathbf{x}_{[M \times 1]}, where yi=j=1MAi,jxjy_i = \sum_{j=1}^M A_{i, j} x_j
  • Matrix-vector product: y[N×1]=A[N×M]x[M×1]\mathbf{y}_{[N \times 1]} = \mathbf{A}_{[N \times M]} \mathbf{x}_{[M \times 1]}, where yi=j=1MAi,jxjy_i = \sum_{j=1}^M A_{i, j} x_j
  • Matrix-matrix product: C[N×K]=A[N×M]B[M×K]\mathbf{C}_{[N \times K]} = \mathbf{A}_{[N \times M]} \mathbf{B}_{[M \times K]}, where Ci,k=j=1MAi,jBj,kC_{i,k} = \sum_{j=1}^M A_{i, j} B_{j, k}
  • Hadamart product (i.e., element-wise product): C[N×M]=A[N×M]B[N×M]\mathbf{C}_{[N \times M]} = \mathbf{A}_{[N \times M]} \odot \mathbf{B}_{[N \times M]}, where Ci,j=Ai,jBi,jC_{i,j} = A_{i, j} B_{i, j}
  • Dot product: a=x[N×1]Ty[N×1]=i=1Nxiyia = \mathbf{x}_{[N \times 1]}^T \mathbf{y}_{[N \times 1]} = \sum_{i=1}^N x_i y_i
  • Identity matrix: IN=diag{1N}\mathbf{I}_N = diag\{\mathbf{1}_N\}. Based on its definition, we have that INx=x\mathbf{I}_N \mathbf{x} = \mathbf{x} and INX=X\mathbf{I}_N \mathbf{X} = \mathbf{X}
  • Inverse matrix: given y=Ax\mathbf{y} = \mathbf{A} \mathbf{x}, the inverse matrix of A\mathbf{A} is a matrix that satisfies the following equality A1A=IN\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_N. We can finally write x=A1y\mathbf{x} = \mathbf{A}^{-1} \mathbf{y}
  • Orthogonal vectors and matrices: given two vectors x\mathbf{x} and y\mathbf{y}, they are said to be orthogonal if yTx=0\mathbf{y}^T \mathbf{x} = 0. Given two matrices X\mathbf{X} and Y\mathbf{Y}, they are said to be orthogonal if YTX=IN\mathbf{Y}^T \mathbf{X} = \mathbf{I}_N. Orthogonal matrices are especially interesting because their inverse is simply X1=XT\mathbf{X}^{-1} = \mathbf{X}^T
  • Matrix decomposition: like any scalar number can be decomposed into a product of prime numbers, a matrix A\mathbf{A} can also be decomposed into a combination of vectors (i.e., eigenvectors) and scalars (i.e., eigenvalues). - Eigendecomposition: real-valued, square, symmetric matrices can be written as A=VΛVT=iλiviviT\mathbf{A} = \mathbf{V} \Lambda \mathbf{V}^T = \sum_i \lambda_i \mathbf{v}_i \mathbf{v}_i^T where λi\lambda_i and vi\mathbf{v}_i are the eigenvalues and eigenvectors of the matrix A\mathbf{A}, respectively. Eigenvectors are placed along the columns of the matrix V\mathbf{V}, which is an orthogonal matrix (i.e., VT=V1\mathbf{V}^T=\mathbf{V}^{-1}). Eigenvalues are placed along the diagonal of the matrix Λ=diag{λ}\Lambda=diag\{\lambda\} and tell us about the rank of the matrix, rank(A)=#λ0rank(\mathbf{A}) = \# \lambda \neq 0. A full rank matrix is matrix whose eigenvalues are all non-zero and can be inverted. In this case the inverse of A\mathbf{A} is A1=VΛ1VT\mathbf{A}^{-1}=\mathbf{V}\Lambda^{-1}\mathbf{V}^T - Singular value decomposition (SVD): this is a more general decomposition which can be applied to real-valued, non-square, non-symmetric matrices. Singular vectors u\mathbf{u} and v\mathbf{v} and singular values λ\lambda generalized the concept of eigenvectors and and eigenvalues. The matrix A\mathbf{A} can be decomposed as A=UDVT\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T where D=Λ\mathbf{D} = \Lambda for square matrices, D=[Λ  0]T\mathbf{D} = [\Lambda \; \mathbf{0}]^T for N>MN>M and D=[Λ  0]\mathbf{D} = [\Lambda \; \mathbf{0}] for M>NM>N. Similar to the eigendecomposition, in this case the inverse of A\mathbf{A} is A1=VD1UT\mathbf{A}^{-1}=\mathbf{V}\mathbf{D}^{-1}\mathbf{U}^T
  • Conditioning: in general, it refers to how fast a function f(x)f(x) changes given a small change in its input xx. Similarly for a matrix, conditioning is linked to the curvature of its associated quadratic form f(A)=xTAxf(\mathbf{A}) = \mathbf{x}^T \mathbf{A} \mathbf{x} and it generally indicates how rapidly this function changes as function of x\mathbf{x}. It is defined as cond(A)=λmaxλmincond(\mathbf{A})=\frac{|\lambda_{max}|}{|\lambda_{min}|}.

Norms: another important object that we will be using when defining cost functions for ML models are norms. A norm is a function that maps a vector xRN\mathbf{x} \in \mathbb{R}^N to a scalar dRd \in \mathbb{R} and it can be loosely seen as measure of the length of the vector (i.e., distance from the origin). In general, the LpL^p norm is defined as:

xp=(ixip)1/p  p0||\mathbf{x}||_p = \left( \sum_i |x_i|^p \right) ^{1/p} \; p \ge 0

Popular norms are:

  • Euclidean norm (L2L_2): x2=ixi2||\mathbf{x}||_2 = \sqrt{\sum_i x_i^2}, is a real distance of a vector from the origin of the N-d Euclidean space. Note that x22=xTx||\mathbf{x}||_2^2 = \mathbf{x}^T \mathbf{x} and that x2=1||\mathbf{x}||_2=1 for a unit vector;
  • L1L_1 norm: x1=ixi||\mathbf{x}||_1 = \sum_i |x_i|
  • L0L_0 norm: number of non-zero elements in the vector x\mathbf{x}
  • LL_\infty norm: x2=maxxi||\mathbf{x}||_2 = max |x_i|
  • Frobenious norm (for matrices): AF=i,jAi,j2||\mathbf{A}||_F = \sqrt{\sum_{i,j} A_{i,j}^2},