Perpendicular Distance Least-Squares Fitting for Arbitrary Lines
(updated )
The mathematician does not study pure mathematics because it is useful; he studies it because he delights in it and he delights in it because it is beautiful.
Henri Poincare
I’m not a mathematician, but I understand the mathematician’s delight. When terms cancel, substitutions simplify, and something that was previously hidden is revealed and exposed to the mind for the first time, the intellect is entertained.
I can still recall one of my earlier moments of “mathematician’s delight.” I was sitting in Mr. Pontifex’s grade 12 physics class as he derived the formulas for obtaining the best-fit line for a set of points. We had already used the equations often to find the answer our lab experiments were designed to produce, but the equations were just prescribed to us with directions how to use them. To finally see how the equations were derived, with the help of the calculus we had just learned, was far more interesting to me than the physics we were using it for.
For reference, here are the equations that find the line, \(y = mx + b\), that “best” fits a set of points.
The standard derivation for this can be found on MathWorld.
These equations, though mathematically correct, are often misapplied. Implicit in the equations is a definition of “best” that isn’t universal. The error being minimized is the sum of the squares of the vertical distances between the data points and the fitted line. This is often best when your data points correspond to experimental observations that measure \(y\) for some independent variable \(X\). But if both the coordinates of your data points are independent, the classic formulation is not only inappropriate, but also mathematically undefined if the best line is vertical since such a line cannot be expressed as \(y = mx + b\).
Least-squares fitting using vertical offsets
What is often the better solution is to find the line that minimizes the perpendicular distances between the data points and the line:
Least-squares fitting using perpendicular offsets.
A solution that uses perpendicular distances can also be found on MathWorld. In that solution, the author notes the derivation takes “a fair bit of algebra” and produces an “unwieldy form of the best-fit parameters”. Furthermore, it still uses \(y = mx + b\) as its basis, so it fails to address the case when the best line is vertical.
While developing an early version of Music Lens I was guilty of misapplying the traditional least-squares line-fitting equations. Music Lens is a program that employs computer vision algorithms to “read” sheet music. Music notation is dominated by a lot of lines (staff lines, barlines✻, stems, beams, etc.) so line-finding is naturally a major component. One of the intermediate results inherent in my line-finding algorithm was knowing whether the line was more horizontal or vertical. This allowed me to swap the \(x\) and \(y\) variables when the line was “more vertical,” thus avoiding the degenerate case of a vertical line. With this hack, the classic least-squares fitting algorithm worked in practice, but I recognized it was inelegant and mathematically questionable. More than two decades after implementing this hack, I finally found a better solution in the chapter “The Best Least-Squares Line Fit” from the book Graphics Gem V (GGV).
The solution in GGV uses perpendicular distances and works for lines at any orientation.
There were several factors that compelled me to write this up:
- 
      This solution is not well known and deserves broader dissemination. The only description I was able to find was in GGV, and this book, first published in 1995, is not easy to find in local bookstores and is expensive on Amazon (or here for my fellow Canadians). 
- 
      My derivation differs slightly than the one in GGV because I use a different line parameterization. Both GGV and my derivation use a \(\rho\)-\(\theta\) parameterization, but our definitions of \(\theta\) differ. Those familiar with the Hough Transform will find the same line parameterization used below. 
- 
      My derivation is not quite complete, and I’m hoping someone reading can fill in the loose ends. 
Without further ado, here is the derivation for the best least-squares line fit using perpendicular distances. All credit for this goes to GGV; the derivation below is based entirely on this work. The only difference is the slightly different line parameterization used. I also show more steps in the derivation, as I found some of the steps in GGV large and non-obvious; it took me many hours to figure out some of the intermediate steps.
In case any readers have access to GGV, I want to point out a newsgroup posting that claims there are errors in the GGV derivation and source code.
Derivation
The vital aspect to this derivation is using the relatively uncommon \(\rho\)-\(\theta\) parameterization of a line, also known as the Hessian normal form. The equation of a line is given as:
The geometric interpretation of \(\rho\) and \(\theta\) are shown in Figure 3.
The \(\rho\), \(\theta\) representation of a line.
The distance between a point \(\mathbf{p} = (x, y)\) and the line \(L_1\) with \((\rho_1, \theta_1)\) can be found by constructing a new line \(L_2\) that is parallel to \(L_1\) and passes through \(\mathbf{p}\). Because \(L_2\) is parallel to \(L_1\), it follows that \(\theta_2 = \theta_1\), and \(\rho_2 = x \cos \theta_1 + y \sin \theta_1\). The distance between \(\mathbf{p}\) and \(L_1\) is then just the difference between \(\rho_1\) and \(\rho_2\):
To find the best fit line to a set of points we need to find the \((\rho, \theta)\) that minimizes the sum of \(D(\mathbf{p}_i, L)^2\).
To minimize equation \ref{eq:error}, the partial derivatives of \(E\) must be zero:
Let’s examine \(\partial E/\partial \rho\) first:
Since (\ref{eq:centroid}) is in the same form as (\ref{eq:line}), the point \((\overline{x}, \overline {y})\), which is the centroid of the points, passes through the best fit line. If we translate all the points by \((-\overline{x}, -\overline{y})\) and find the best fit line to the translated points, it’s intuitively clear from the geometry that this line has the same \(\theta\) as the best fit line to the original points.
To avoid introducing extra notation, I’m going to skip the formalism of introducing new variables to represent the translated points and just pretend that the centroid of the original points is at the origin. So, \(\overline{x} = \overline{y} = 0\). Consequently, \(\sum_{i=i}^{n} x_i = \sum_{i=i}^{n} y_i = 0\). From equation (\ref{eq:centroid}) it follows that \(\rho = 0\). This greatly simplifies the solution for finding \(\theta\) as any terms with \(\sum_{i=i}^{n} x_i\), \(\sum_{i=i}^{n} y_i\), or \(\rho\) can be dropped. (Once we’ve found the solution for \(\theta\), we’ll disregard this convenient masquerade and have \(\overline{x}\) and \(\overline{y}\) take on their real values to determine \(\rho\) for the original, untranslated points.)
Let’s looks at \(\partial E/\partial \theta\):
where
Proceed by dividing (\ref{eq:theta}) by \(\cos^2 \theta\):
Using the quadratic formula, solving the quadratic (\ref{eq:theta2}) yields the following solutions:
For those only interested in how to compute the best fit line, we’re done. Of the two solutions, \(\theta_1\) is the correct value for \(\theta\), and \(\rho\) is given by equation (\ref{eq:centroid}). So, we have:
Recall that earlier we pretended the original points were translated by \((-\overline{x}, -\overline{y})\) so we could assume \(\overline{x} = \overline{y} = 0\) and take advantage of some convenient simplifications. In equation (\ref{eq:rho}), we stop pretending, so \(\overline{x}\) and \(\overline{y}\) describe the centroid of the original points.
When implementing in software, the case \(b = 0\) is not problematic by using the atan2(y, x) function.
A Swift implementation of this is available as a GitHub gist.
To complete this derivation, we need to show that \(\theta_1\) is the solution for the minimum and not the maximum (which would be the worst fitting line). To prove that \(\theta_1\) is the minimum, the second derivative test can be used: it must be shown that the second derivative of \(E\) at \(\theta_1\) is positive.
If we take the second partial derivative of \(E\) with respect to \(\theta\) and substitute \(\rho = 0\) and the summation terms with equations (\ref{eq:a}) and (\ref{eq:b}), we get:
It takes some algebra to get the above, but it’s straight forward. The next steps are less obvious (to me, anyway). From equation (\ref{eq:quadratic}) we can express the solution for \(\theta_1\) (I’ll drop the subscript from here on) as:
If we introduce the following substitutions:
we can rewrite (\ref{eq:ratio}) as:
which we can rewrite as two equations by introducing a constant \(t\):
Now we rewrite the partial derivative (\ref{eq:partial}), replacing instances of \(a\), \(b\), \(\cos \theta\), and \(\sin \theta\) with their values expressed in terms of \(\alpha\), \(\beta\), \(\gamma\), and \(t\):
We’ll eliminate \(\beta^2\) by noting that \(\beta^2 = \gamma^2 - \alpha^2\). This gives us:
From the definition of \(\gamma\) it follows that \(\gamma \gt 0\) and that \(\gamma \gt \alpha\) (if \(b \ne 0\)). This implies that \(\alpha - \gamma \lt 0\). Also \(t^2\) and \(\gamma^2\) are clearly both positive. Since there are two negative factors and two positive factors in (\ref{eq:partial2}), it must be that \(\partial^2E/\partial \theta^2 > 0\). Therefore, \(E\) has a minimum at \(\theta_1\).
We could follow an almost identical sequence of steps to show that the other solution to the quadratic (\ref{eq:quadratic}), \(\theta_2\), is the maximum of \(E\).
There are a few loose ends that I’ve failed to cover. To get (\ref{eq:theta2}) I’ve assumed that that \(\cos \theta \ne 0\), and thus safe to divide by. In deducing that equation (\ref{eq:partial2}) is positive, it was assumed that \(b \ne 0\). I’ve done several tests of my Swift implementation with points that exercise the case \(b = 0\), and the solution seems to work in all cases. But a proof that \(\theta_1\) is the minimum when \(b = 0\) eludes me. I’ve tried the extreme test, but I’ve been unable to show that a requisite higher order derivative is positive. If any reader can fill in the missing bits, I’d love to hear from you.
Update (May 2021)
Since this article was published I’ve received some valuable feedback from a few readers that is worth sharing†.
Alternate Solutions
Pablo Lessa sent me a very succinct solution using complex numbers. Points are represent by numbers in the complex plane; i.e. the point \((x_i, y_i)\) is presented by a \(z_i = x_i+y_{i}i\). The solution is then obtained as follows:
Let \(p_1 = \mu-\sigma\) and \(p_2 = \mu+\sigma\). The line passing through \(p_1\) and \(p_2\) is the solution. There’s an edge case when \(\sigma^2 = 0\) that requires special handling; I’ll ignore it here, but interested readers should see the Wikipedia article Deming regression which describes this algorithm.
Incidentally, that Wikipedia article mentions the article Two geometrical applications of the mathematics of least squares, written in 1913, that uses the same line parameterization that I use here, and also starts its derivation with equation (\ref{eq:error}), but takes a different algebraic path.
Floating-point Issues
Alan Nunns provided two tips to improve the floating-point numeric stability of my Swift implementation (which have now been incorporated).
The first improvement is in how the centroid is computed. Each coordinate of the centroid is just an arithmetic mean. The naïve implementation sums all the \(x_i\)’s (and \(y_i\)’s) and then divides the sum by \(n\). The flaw with this approach is when you have a lot of samples, the summation can produce a very large number. Adding a very large number to a small number using floating-point arithmetic has a large error (in the worst case, \(S \oplus x = S\), where \(S \gg x\) and \(\oplus\) is floating-point addition). The article Numerically stable computation of arithmetic means provides a nice writeup of how means should be computed.
The second improvement is how \(y_i^2 - x_i^2\) is computed. The naïve implementation is susceptible to catastrophic cancellation, and it is better to evaluate \((x - y)(x + y)\). The excellent and oft-cited article What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg has this exact example (plus a lot of theory and many more examples).
While reading Goldberg’s article, I became aware of another potential floating-point issue with my original Swift implementation. In (\ref{eq:finaltheta}), the numerator expression \(a-\sqrt{a^2+4b^2}\) is susceptible to catastrophic cancellation when \(a^2 \gg b^2\) and \(a > 0\). In this case, \(\sqrt{a^2+4b^2} \approx |a|\), resulting in the subtraction of two nearly equal values, which can lead to catastrophic cancellation. To avoid this, we can multiply the numerator and denominator of the argument of the arc tangent function by \(a+\sqrt{a^2+4b^2}\):
Finally, the expression \(\sqrt{a^2+4b^2}\) should be evaluated with the hypot function, if available. hypot(x, y) computes the value of \(\sqrt{x^2+y^2}\) in a way that limits underflow or overflow errors. For example, in Swift, instead of using sqrt(a*a+4*b*b), use hypot(a, 2*b).
References
[1] Alciatore, David, and Rick Miranda. 1995. “The Best Least-Squares Line Fit,” in Graphics Gems V, 91–97. Academic Press, Inc.
[2] Wikipedia contributors, “Deming regression,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Deming_regression&oldid=993424551 (accessed May 5, 2021).
[3] Diego Assencio, “Numerically stable computation of arithmetic means,” diego.assencio.com, https://diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59 (accessed May 5, 2021).
[4] David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5–48.
Most dictionaries spell this “bar lines”, but if Behind Bars: The Definitive Guide to Music Notation (by Elain Gould) spells it “barlines” then so do I. ↩︎
I’m very late providing this update, as the feedback was received years ago. ↩︎