Packet 14

Application I: Least squares and linear regression

Least squares

Recall that one application of Gram-Schmidt is to compute the distance from a vector ๐ฏ to a subspace U that is given as the span U=โŸจ๐ฎ1,โ€ฆ,๐ฎkโŸฉโŠ‚โ„n. First we compute an orthogonal basis U=โŸจ๐ฐ1,โ€ฆ๐ฐโ„“โŸฉ. Then define the projection:

projU(๐ฏ)=proj๐ฐ1(๐ฏ)+โ‹ฏ+proj๐ฐโ„“(๐ฏ).

Finally, define ๐ฏโŸ‚=๐ฏโˆ’projU(๐ฏ), and the distance is given by |๐ฏโŸ‚|.

The closest point in U to the vector ๐ฏ is the projection projU(๐ฏ). So the distance from ๐ฏ to U means the distance from ๐ฏ to the closest point in U, and this is the same as the magnitude of the displacement (difference) vector ๐ฏโˆ’projU(๐ฏ).

The least squares problem (or โ€˜LLSโ€™ for โ€˜Linear Least Squaresโ€™) is the problem of finding the value of ๐ฑโˆˆโ„n that minimizes the quantity |๐›โˆ’A๐ฑ|, where ๐› is a given vector, and A a given matrix.

Sometimes the least-squares solution is called the best solution to an inconsistent system. It may happen that A๐ฑ=๐› has no solutions ๐ฑ, for example when A has small rank relative to the dimension of the vector ๐›. When there are no solutions ๐ฑ, the system of equations that A๐ฑ=๐› represents is called an inconsistent system. The least-squares solution is the best approximate solution, in the sense that A๐ฑ is made as close as possible to ๐›.

center

center

โ€˜Least squaresโ€™

The terminology โ€˜least squaresโ€™ is used because the minimum of |๐›โˆ’A๐ฑ| occurs for the same ๐ฑ as the minimum of |๐›โˆ’A๐ฑ|2, and the latter can be written as a sum of squares. For example, if ๐›=(b1b2) and A=(aij), then:

|๐›โˆ’A๐ฑ|2=(b1โˆ’a11x1โˆ’a12x2)2+(b1โˆ’a21x1โˆ’a22x2)2.

When varying ๐ฑ over all possible input vectors, the output A๐ฑ varies over the image (aka โ€˜column spaceโ€™) of A. Therefore, the least squares problem is the same as the problem of finding the projection of ๐› to the column space of A, and then finding an ๐ฑ which is sent by A to this projection.

The column space of A is the span of columns of A. Therefore, one valid method is to perform Gram-Schmidt to compute the projection projIm(A)(๐›), and then to solve the matrix equation A๐ฑ=projIm(A)(๐›).

Another valid method, usually faster, makes use of some calculations in matrix algebra:

Vectors ๐ฑ that minimize |๐›โˆ’A๐ฑ| are the same as vectors ๐ฑ that solve A๐–ณA๐ฑ=A๐–ณ๐›

First we consider an (equivalent) rearrangement of the proposed equation:

A๐–ณA๐ฑ=A๐–ณ๐›A๐–ณ๐›โˆ’A๐–ณA๐ฑ=๐ŸŽA๐–ณ(๐›โˆ’A๐ฑ)=๐ŸŽ.

Recall that A๐–ณ๐ฐ=๐ŸŽ occurs if and only if ๐ฐ is perpendicular to the columns (and hence to the image) of A.

So the proposed equation holds for ๐ฑ if and only if ๐›โˆ’A๐ฑ is perpendicular to the image of A.

On the other hand, the projection projIm(A)(๐›) is the unique vector within Im(A) such that ๐›โŸ‚=๐›โˆ’projIm(A)(๐›) is perpendicular to Im(A). Therefore:

A๐–ณA๐ฑ=A๐–ณ๐›โŸบA๐ฑ=projIm(A)(๐›)โŸบ๐ฑย minimizesย |๐›โˆ’A๐ฑ|

The least-squares error is the minimum value of |๐›โˆ’A๐ฑ|. To find this error, first find the minimizing ๐ฑ and then compute this quantity directly.

Compare: Projection Method vs. A๐–ณA๐ฑ=A๐–ณ๐›

We have two methods of finding ๐ฑ vectors that minimize |๐›โˆ’A๐ฑ|.

First method: Perform Gram-Schmidt on (๐š1โ‹ฏ๐šk) and then compute the projection of ๐› to the span โŸจ๐š1,โ€ฆ,๐škโŸฉ using the orthogonal basis you found with Gram-Schmidt. This vector can be used immediately to find the least-squares error, or plugged into the second equation A๐ฑ=projIm(A)(๐›) to solve for ๐ฑ.

Second method: Compute the matrix A๐–ณA and vector A๐–ณ๐› first, and then solve the matrix equation A๐–ณA๐ฑ=A๐–ณ๐› for possible ๐ฑ. All solutions ๐ฑ minimize the least-squares error, and can be plugged into |๐›โˆ’A๐ฑ| to find that error.

Question: Which method is better?

Answer: If A has orthogonal columns, then the projections method is better. Otherwise the equation A๐–ณA๐ฑ=A๐–ณ๐› is better.

Reason: When A has orthogonal columns, Gram-Schmidt can be skipped, and the projection calculated directly. Furthermore, the scalar projections are the components on columns of A โ€“ and these are exactly the components of ๐ฑ. The projection method is much less efficient except in this simple case of orthogonal columns.

Example

Computing least-squares solution: non-orthogonal case

Problem: Consider the matrix equation (400211)๐ฑ=(206). There is no exact solution; find the best approximate โ€œleast-squaresโ€ solution and the resulting error.

Solution: Write A and ๐› for the matrix and vector. Next calculate A๐–ณA=(17115), and A๐–ณ๐›=(146). Now solve the matrix equation

A๐–ณA=(17115)(x1x2)=(146).

We use row-reduction on the augmented matrix:

(17114156)โˆผ(17114084/1788/17)โˆผ(1711408488).

Therefore x2=88/84=22/21 and x1=(14โˆ’x2)/17=16/21.

To compute the error, first find A๐ฑ:

A๐ฑ=(400211)(16/2122/21)=(64/2144/2138/21)

and then subtract from ๐› and compute the norm:

|๐›โˆ’A๐ฑ|=|(206)โˆ’(64/2144/2138/21)|=|(โˆ’22/21โˆ’44/21โˆ’88/21)|=121222+442+882=1016421โ‰ˆ4.80.

Many solutions possible

If the null space of A is not just the zero vector, then elements of the null space may be added to any best approximate solution ๐ฑ without changing the output A๐ฑ. So the linear least-squares problem may have many solutions. All these solutions will be discovered as solutions of the equation A๐–ณA๐ฑ=A๐–ณ๐›.

Example

Computing least-squares solution: orthogonal case

Problem: Find the least-squares solution for the following matrix equation:

A๐ณ=๐›:(1โˆ’31110141โˆ’2)๐ณ=(4โˆ’1โˆ’1โˆ’31).

Notice that the second column of the matrix and the vector represent the x and y values, respectively, of the data from Exercise 07-02.

Solution: The matrix columns are orthogonal. In this situation it is faster to compute the projection directly:

projโŸจ๐š1,๐š2โŸฉ(๐›)=๐›โ‹…๐š1๐š1โ‹…๐š1๐š1+๐›โ‹…๐š2๐š2โ‹…๐š2๐š2=05๐š1+โˆ’2730๐š2=(27/10โˆ’9/100โˆ’36/1018/10).

Furthermore, we have 0/5 and โˆ’27/30 for the coefficients of ๐ณ that give the best approximate solution, since these are the weights on the columns of A that yield the projection. Therefore ๐ณ=(0โˆ’9/10) is our solution.

Finally, the error is given by:

|(4โˆ’1โˆ’1โˆ’31)โˆ’(27/10โˆ’9/100โˆ’36/1018/10)|=|(โˆ’57/1019/10076/10โˆ’38/10)|=1083010โ‰ˆ10.41.
Question 14-01

Missing step?

In the above example, where did we solve the matrix equation A๐ฑ=projIm(A)(๐›) to find ๐ฑ? Explain.

Linear regression for summation curves

Let us say that a โ€˜summation curveโ€™ is a function written as a sum of terms, where the terms have coefficient parameters (parameters that scale the terms).

โ€˜summationย curvesโ€™:f(x)=a0+a1x+a2x2+a3x3,f(x)=Acosโก(3x)+Bsinโก(3x)notย โ€˜summationย curvesโ€™:f(x)=eax,f(x)=x2+a,f(x)=cosโก(ax)+sinโก(bx)

center

Summation curves can be fitted to given data using linear least squares. Each term in the summation is evaluated at the input data, and the parameters are provided as a parameter vector, and the summation arises from the design matrix acting on this vector; this matrix incorporates the input values. The individual error terms form a vector called the error vector. The output data is combined into an observation vector.

y1=a0+a1x1+a2x12+a3x13+ฮต1y2=a0+a1x2+a2x22+a3x23+ฮต2โ‹ฎโ‹ฎyn=a0+a1xn+a2xn2+a3xn3+ฮตnโ‡โ‡(y1y2โ‹ฎyn)=(1x1x12x131x2x22x23โ‹ฏโ‹ฏ1xnxn2xn3)(a0a1โ‹ฎan)+(ฮต1ฮต2โ‹ฎฮตn) ๐ฒ=X๐š+๐ž

As you can see, the function part of each term, such as x3 in a3x3, is first evaluated at the input data points, and the attached coefficient derives from the parameter vector. The variable of the problem is this parameter vector. We are interested in finding the vector ๐š that minimizes the total โ€˜least squaresโ€™ error quantity |๐ž|=|๐ฒโˆ’X๐š|.

The procedure from here is just the same as in simple linear regression: given X and ๐ฒ, we first compute the normal equations X๐–ณX๐š=X๐–ณ๐ฒ, and then we row reduce and solve for ๐š.

Input vector vs. input matrix

Notice something about the linear regression. Errors in the input vectors are ignored: |๐ž| is the total sum of squares error in the output vector.

Building further on that consideration, we can see that it does not matter whether the input is considered to be the vector (x1x2โ‹ฎxn) or instead the entire โ€˜design matrixโ€™ X. Each term output, like x23, on a given โ€œinput vectorโ€ is treated as part of the input data which is all stored in the matrix X. There is no propagation of errors through the term functions, like x3, because errors in input data are not accounted for at all.

Another way to put this: the LLS method does not distinguish between a0+a1x+a2x2+a3x3 treated as a function of the single variable x, and the same value treated as a function of the separate input numbers 1,x,x2,x3 for any given x number.

Example

Finding a best fit cubic

Suppose we have some data like this:

Input:123456
Output:598468
Letโ€™s fit a cubic polynomial f(x)=a0+a1x+a2x2+a3x3 to this data. We use the objects:
X=(111112481392714166415251251636216),๐ฒ=(598468),๐š=(a0a1a2a3).

First, compute:

X๐–ณX=(62191441219144122759144122751220144122751220167171),A๐–ณ๐ฒ=(401416153027).

Row reduction of the augmented matrix (X๐–ณXA๐–ณ๐ฒ) yields:

(X๐–ณXA๐–ณ๐ฒ)โˆผ(254016000โˆ’13547520254016003828048002540160โˆ’1275120000254016122304)โ‡๐š=(โˆ’16/311393/756โˆ’1265/25213/27)

Therefore our best fit cubic polynomial is:

y=โˆ’163+11393756x+โˆ’1265252x2+1327x3.

Linear after some transformation

Some functions appear not to be amenable to linear regression, but they can be transformed to become so.

Consider the class of functions f(x)=aerx with two parameters a and r. These functions are not โ€˜summation curvesโ€™.

However, if we take the log of the observation vector, we can fit a linear function with parameters lnโกa and r:

lnโกy=lnโกa+rx.

LLS can be used to find the best fit values of lnโกa and r as if these are the y-intercept and slope of a line. Taking elnโกa=a reveals the desired a from the discovered lnโกa, and then a and r can be plugged back into f(x)=aerx.

Multiple regression

It frequently happens that our model predicts observations on the basis of multiple input variables. This situation can be handled using LLS in the same way as โ€˜summation curvesโ€™ are handled. The entries in the design matrix include all the input values, and the parameter vector provides the candidate coefficients for the linear combinations of those input values.

f(u,v)=a+bu+cv,g(u,v)=a+bu+cv+du2+euv+fv2.

The design matrix will contain the values of u and v (for f) and u, v, u2, uv, and v2 (for g) that correspond to the various output values in the observation vector. The parameter vector will be (a,b,c)๐–ณ (for f) and (a,b,c,d,e,f)๐–ณ (for g).

The graph of f(u,v), given by setting z=f(x,y), will be a plane passing through z=a at (x,y)=(0,0). We can use LLS to find the plane of best fit for a collection of 3D vectors. For example, if we are given a set of points in 3D space:

(x1,y1,z1)(x2,y2,z2)โ‹ฎ(xn,yn,zn),

then we can find the best fit plane by solving the following LLS problem:

z1=a+bx1+cy1+ฮต1z2=a+bx2+cy2+ฮต2โ‹ฎโ‹ฎzn=a+bxn+cyn+ฮตnโ‡โ‡(z1z2โ‹ฎzn)=(1x1y11x2y2โ‹ฎโ‹ฎโ‹ฎ1xnyn)(abc)+(ฮต1ฮต2โ‹ฎฮตn) ๐ณ=X๐š+๐ž

Penrose quasi-inverse

The Penrose quasi-inverse, also called Moore-Penrose pseudo-inverse, is a matrix construction that always exists, and is very easy to find using the SVD, and it is directly related to the LLS problem.

Given a matrix A=UฮฃV๐–ณ, its Penrose quasi-inverse is the matrix A+=Vฮฃ+U๐–ณ. Here ฮฃ+ is derived from ฮฃ by first transposing, and second inverting the main diagonal entries (i.e. taking their reciprocals).

The meaning of this matrix A+ is that it acts like an inverse for vectors in the image of A, while for other vectors it first projects to A and then applies the A+ inversion after that.

The connection to LLS is that if X and ๐ฒ are given, then X+๐ฒ is the least-squares solution vector ๐š.

Which LLS solution is it?

If there are many solutions for ๐š, because A has non-trivial null space, then, as you can see from the SVD formula, this X+๐ฒ picks the one with zero component in the kernel and nonzero part only in the cokernel. (This is the part spanned by the first vectors ๐ฏ1,๐ฏ2,โ€ฆ,๐ฏk, up to the point where ฯƒi=0 for i>k.)

To summarize: LLS can have many solutions, but the single choice A+๐ฒ is the specific solution lying entirely in the cokernel of A. Other solutions deviate by a difference vector which lies in the kernel of A.

Example

Using Penrose quasi-inverse to solve LLS

In Packet 13 we found the following SVD:

A=(โˆ’316โˆ’26โˆ’2)=(1/32/52/45โˆ’2/31/5โˆ’4/45โˆ’2/305/45)(31000000)(โˆ’3/101/101/103/10)=UฮฃV๐–ณ.

Problem: Use the SVD above in order to (a) find the Penrose quasi-inverse A+, and (b) find a vector ๐ฑ that minimizes |A๐ฑโˆ’๐›| for ๐›=(224).

Solution: The quasi-inverse is formed by transposing ฮฃ and inverting the entries. We obtain for (a):

ฮฃ+=(131000000), A+=Vฮฃ+U๐–ณ=(โˆ’3/101/101/103/10)(131000000)(1/3โˆ’2/3โˆ’2/32/51/502/45โˆ’4/455/45)=(โˆ’1/301/151/151/90โˆ’1/45โˆ’1/45).

Then (b) to solve the LLS, we simply compute A+๐› and obtain (1/3โˆ’1/9).

Notice that after calculating A+ once, we can solve any LLS problem very efficiently. That is a nice feature.

Problems due 26 Apr 2024 by 11:59pm

Problem 14-01

Linear least-squares: non-orthogonal columns

Solve the linear least-squares problem for the following inconsistent system:

(1โˆ’2โˆ’120325)๐ฑ=(31โˆ’42).

Compute the error to two decimal places.

Problem 14-02

Linear least-squares: orthogonal columns

Solve the linear least-squares problem for the following inconsistent system, using the fact that the columns of A are orthogonal:

(4011โˆ’516101โˆ’1โˆ’5)๐ฑ=(9000).

Compute the error to two decimal places.

Problem 14-03

Fitting a plane to four points

Consider the four data points in 3D space:

(x,y,z)=(1,0,0),(0,1,1),(โˆ’1,0,3),(0,โˆ’1,4).

Find the parameters a,b,c for which the plane defined by z=a+bx+cy best fits these data points.

You may notice that the data points correspond to heights z=0,1,3,4 at the four corners of a square. Normally a plane is defined by 3 points (non-colinear) through which it passes. No plane passes through these four given points. But there is a plane that minimizes the sum of squares of vertical errors.

Problem 14-04

LLS summation curve model

Consider the data points: (x,y)=(1,7.9),(2,5.4),(3,โˆ’0.9). Find the parameters of best fit for the model y=Acosโกx+Bsinโกx. Clearly identify your design matrix, observation vector, and parameter vector. Compute the error vector and the total error.

Problem 14-05

Penrose quasi-inverse

For this problem, use the definition of A+ based on the SVD of A.

  • (a) Verify that AA+A=A and that A+AA+=A+ for the example A=(โˆ’316โˆ’26โˆ’2) studied in the section on quasi-inverse. Think about how and why this happens in terms of the SVD.
  • (b) Show that for any A matrix, AA+๐ฏ is the projection of ๐ฏ onto the image of A.
  • (c) Show that for any A matrix, AA+A=A and A+AA+=A+.

It may be helpful (though not required) for this problem to use the presentation of SVD in the form A=ฯƒ1๐ฎ1๐ฏ1๐–ณ+ฯƒ2๐ฎ2๐ฏ2๐–ณ+โ‹ฏ as in Packet 13. In order to use this, figure out what the corresponding presentation of the SVD of A+ should be.

Problem 14-06

LLS uniqueness

  • (a) Show that: A๐ฑ=๐ŸŽ if and only if A๐–ณA๐ฑ=๐ŸŽ. (Hint: use the fact that ๐ฑ๐–ณA๐–ณA๐ฑ=|A๐ฑ|2.)
  • (b) Show that: A๐–ณA is invertible if and only if A has independent columns. (Warning: A need not be square. Hint: consider the dimensions of kernels, and use (a) as well as the rank-nullity theorem.)
  • (c) Explain why the LLS problem for A and ๐› has a unique solution if and only if the columns of A are independent. (Use (b).)
  • (d) Continuing from (b), show that: rank(A)=rank(A๐–ณA). (Hint: apply rank-nullity to both matrices. Notice that A and A๐–ณA have the same number of columns.)

You may observe that (a) implies dimโกKer(A)=dimโกKer(A๐–ณA), while (d) means dimโกIm(A)=dimโกIm(A๐–ณA). In other words, A and A๐–ณA always have the same size kernels and the same size images. This makes sense in terms of the SVD if you think about kernels, cokernels, and images generated by the orthonormal basis vectors ๐ฎiโ€™s and ๐ฏiโ€™s, writing A=ฯƒ1๐ฎ1๐ฏ1๐–ณ+ฯƒ2๐ฎ2๐ฏ2๐–ณ+โ‹ฏ as in the previous problem (keeping careful track of which ฯƒiโ€™s are zero or nonzero!).