Editing 2048: Curve-Fitting

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 8: Line 8:
  
 
==Explanation==
 
==Explanation==
 +
{{incomplete|Please edit the explanation below and only mention here why it isn't complete. Do NOT delete this tag too soon.}}
  
 
An illustration of several plots of the same data with {{w|Curve fitting|curves fitted}} to the points, paired with conclusions that you might draw about the person who made them. These data, when plotted on an X/Y graph, appear to have a general upward trend, but the data is far too noisy, with too few data points, to clearly suggest any specific growth pattern. In such a case, many different mathematical and statistical models ''could'' be presented as roughly fitting the data, but none of them fits well enough to compellingly represent the data.  
 
An illustration of several plots of the same data with {{w|Curve fitting|curves fitted}} to the points, paired with conclusions that you might draw about the person who made them. These data, when plotted on an X/Y graph, appear to have a general upward trend, but the data is far too noisy, with too few data points, to clearly suggest any specific growth pattern. In such a case, many different mathematical and statistical models ''could'' be presented as roughly fitting the data, but none of them fits well enough to compellingly represent the data.  
Line 21: Line 22:
 
<math>f(x) = mx + b</math>
 
<math>f(x) = mx + b</math>
  
{{w|Linear regression}} is the most basic form of regression; it tries to find the straight line that best approximates the data. As it's the simplest, most widely taught form of regression, and in general differentiable functions are locally well approximated by a straight line, it's usually the first and most trivial attempt of fit.
+
{{w|Linear regression}} is the most basic form of regression; it tries to find the straight line that best approximates the data. As it's the simplest, most widely taught form of regression, and in general derivable function are locally well approximated by a straight line, it's usually the first and most trivial attempt of fit.
  
The picture to the right shows how totally different data sets can result in the same line. It's obvious that some more basics about the nature of the data must be used to understand if this simple line really does make sense.
+
The picture to the right shows how totally different data sets can result into the same line. It's obvious that some more basics about the nature of the data must be used to understand if this simple line really does make sense.
  
 
The comment below the graph ''"Hey, I did a regression."'' refers to the fact that this is just the easiest way of fitting data into a curve.
 
The comment below the graph ''"Hey, I did a regression."'' refers to the fact that this is just the easiest way of fitting data into a curve.
Line 32: Line 33:
 
{{w|Polynomial regression|Quadratic fit}} (i.e. fitting a parabola through the data) is the lowest grade polynomial that can be used to fit data through a curved line; if the data exhibits clearly "curved" behavior (or if the experimenter feels that its growth should be more than linear), a parabola is often the first, easiest, stab at fitting the data.
 
{{w|Polynomial regression|Quadratic fit}} (i.e. fitting a parabola through the data) is the lowest grade polynomial that can be used to fit data through a curved line; if the data exhibits clearly "curved" behavior (or if the experimenter feels that its growth should be more than linear), a parabola is often the first, easiest, stab at fitting the data.
  
The comment below the graph ''"I wanted a curved line, so I made one with math."'' suggests that a quadratic regression is used when straight lines no longer satisfy the researcher, but they still want to use simple math expression. Quadratic correlations like this are mathematically valid and one of the simplest kind of curve in math, but this curve doesn't appear to satisfy the data any better than does simple, linear regression.
+
The comment below the graph ''"I wanted a curved line, so I made one with math."'' suggests that a quadratic regression is used when straight lines no longer satisfy the researcher, but he still wants to use simple math expression. Quadratic correlations like this are mathematically valid and one of the simplest kind of curve in math, but this curve doesn't appear to satisfy the data any better than does simple, linear regression.
  
 
===Logarithmic===
 
===Logarithmic===
 
[[File:Logarithm_plots.png|thumb|200px|Common logarithm functions.]]
 
[[File:Logarithm_plots.png|thumb|200px|Common logarithm functions.]]
<math>f(x) = a\log_b(x)</math>
+
<math>f(x) = a\log_b(x) + c</math>
  
A {{w|Logarithm|logarithmic}} curve grows slower on higher values, but still grows without bound to infinity rather than approaching a horizontal {{w|asymptote}}. The small ''b'' in the formula represents the base which is in most cases ''{{w|e (mathematical constant)|e}}'', 10, or 2. If the data presumably does approach a horizontal asymptote then this fit isn't an effective method to explain the nature of the data.
+
A {{w|Logarithm|logarithmic}} curve growths slower on higher values, but still grows without bound to infinity rather than approaching a horizontal {{w|asymptote}}. The small ''b'' in the formula represents the base which is in most cases ''{{w|e (mathematical constant)|e}}'', 10, or 2. If the data presumably does approach a horizontal asymptote then this fit isn't an effective method to explain the nature of the data.
  
 
The comment below the graph ''"Look, it's tapering off!"'' builds up the impression that the data diminishes while under this fit it's still growing to infinity, only much slower than a linear regression does.
 
The comment below the graph ''"Look, it's tapering off!"'' builds up the impression that the data diminishes while under this fit it's still growing to infinity, only much slower than a linear regression does.
Line 44: Line 45:
 
===Exponential===
 
===Exponential===
 
[[File:Exponential.svg|thumb|200px|Exponential growth (green) compared to other functions.]]
 
[[File:Exponential.svg|thumb|200px|Exponential growth (green) compared to other functions.]]
<math>f(x) = a\cdot b^x</math>
+
<math>f(x) = a\cdot b^x + c</math>
  
An {{w|Exponential growth|exponential curve}}, on the contrary, is typical of a phenomenon whose growth gets rapidly faster and faster - a common case is a process that generates stuff that contributes to the process itself; think bacteria growth or compound interest.
+
An {{w|Exponential growth|exponential curve}}, on the contrary, is typical of a phenomenon whose growth gets rapidly faster and faster - a common case is a process that generates stuff that contributes to the process itself, think bacteria growth or compound interest.
  
 
The logarithmic and exponential interpretations could very easily be fudged or engineered by a researcher with an agenda (such as by taking a misleading subset or even outright lying about the regression), which the comic mocks by juxtaposing them side-by-side on the same set of data.
 
The logarithmic and exponential interpretations could very easily be fudged or engineered by a researcher with an agenda (such as by taking a misleading subset or even outright lying about the regression), which the comic mocks by juxtaposing them side-by-side on the same set of data.
Line 53: Line 54:
  
 
===LOESS===
 
===LOESS===
A {{w|Local regression|LOESS fit}} doesn't use a single formula to fit all the data, but approximates data points locally using different polynomials for each "zone" (weighting data points differently as they get further from it) and patching them together. As it has many more degrees of freedom compared to a single polynomial, it generally "fits better" to any data set, although it is generally impossible to derive any strong, "clean" mathematical correlation from it - it is just a nice smooth line that approximates the data points well, with a good degree of rejection from outliers.
+
A {{w|Local regression|LOESS fit}} doesn't use a single formula to fit all the data, but approximates data points locally using different polynomials for each "zone" (weighting differently data points as they get further from it) and patching them together. As it has much more degrees of freedom compared to a single polynomial, it generally "fits better" to any data set, although it is generally impossible to derive any strong, "clean" mathematical correlation from it - it is just a nice smooth line that approximates well the data points, with a good degree of rejection from outliers.
  
The comment below the graph ''"I'm sophisticated, not like those bumbling polynomial people."'' emphasises this more complicated interpretation, but without a simple mathematical description it's not very helpful to find informative interpretations of the underlying data.
+
The comment below the graph ''"I'm sophisticated, not like those bumbling polynomial people."'' emphasis this more complicated interpretation but without a simple mathematical description it's not much helpful to find academic descriptions on the underlying matter.
  
 
===Linear, No Slope===
 
===Linear, No Slope===
 
<math>f(x) = c</math>
 
<math>f(x) = c</math>
 
Also known as a constant function, since the function takes on the same (constant) value ''c'' for all values of ''x''. The value of ''c'' can be determined simply by taking the average of the ''y''-values in the data.
 
  
 
Apparently, the person making this line figured out pretty early on that their data analysis was turning into a scatter plot, and wanted to escape their personal stigma of scatter plots by drawing an obviously false regression line on top of it. Alternatively, they were hoping the data would be flat, and are trying to pretend that there's no real trend to the data by drawing a horizontal trend line.
 
Apparently, the person making this line figured out pretty early on that their data analysis was turning into a scatter plot, and wanted to escape their personal stigma of scatter plots by drawing an obviously false regression line on top of it. Alternatively, they were hoping the data would be flat, and are trying to pretend that there's no real trend to the data by drawing a horizontal trend line.
  
The comment below the graph ''"I'm making a scatter plot but I don't want to."'' is probably done by a student who isn't happy with their choice of field of study.
+
The comment below the graph ''"I'm making a scatter plot but I don't want to."'' is probably done by a student who isn't happy with its choice of field of study.
  
 
===Logistic===
 
===Logistic===
Line 92: Line 91:
 
A classical example in physics are the different theories to explain the black body radiation at the end of the 19th century. The {{w|Wien approximation}} was good for small wavelengths while the {{w|Rayleigh–Jeans law}} worked for the larger scales (large wavelength means low frequency and thus low energy.) But there was a gap in the middle which was filled by the {{w|Planck's law}} in 1900.
 
A classical example in physics are the different theories to explain the black body radiation at the end of the 19th century. The {{w|Wien approximation}} was good for small wavelengths while the {{w|Rayleigh–Jeans law}} worked for the larger scales (large wavelength means low frequency and thus low energy.) But there was a gap in the middle which was filled by the {{w|Planck's law}} in 1900.
  
The comment below the graph ''"I have a theory, and this is the only data I could find."'' is a bit ambiguous because there are many data points ignored. Without an explanation why only a subset of the data is used this isn't a useful interpretation at all. As a matter of fact, with the extra degrees of freedom offered by the piecewise regression, it could indicate that the researcher is trying to fit the data to confirm their theory, rather than building their theory off of the data.
+
The comment below the graph ''"I have a theory, and this is the only data I could find."'' is a bit ambiguous because there are many data points ignored. Without an explanation why only a subset of the data is used this isn't a useful interpretation at all.
  
 
===Connecting lines===
 
===Connecting lines===
 
This is often used to smooth gaps in measurements. A simple example is the weather temperature which is often measured in distinct intervals. When the intervals are high enough it's safe to assume that the  temperature didn't change that much between them and connecting the data points by lines doesn't distort the real situation in many cases.
 
This is often used to smooth gaps in measurements. A simple example is the weather temperature which is often measured in distinct intervals. When the intervals are high enough it's safe to assume that the  temperature didn't change that much between them and connecting the data points by lines doesn't distort the real situation in many cases.
  
The comment below the graph ''"I clicked 'Smooth Lines' in {{w|Microsoft Excel|Excel}}."'' refers to the well known spreadsheet application from {{w|Microsoft Office}}. Like other spreadsheet applications it has the feature to visualize data from a table into a graph by many ways. "Smooth Lines" is a setting meant for use on a {{w|line graph}}, a graph in which one axis represents time; as it simply joins up every point using bezier (or similar) curves as necessary to pass through every point (rather than finding a more sensible line that accepts some minimal but non-zero acceptible level of error in the datapoints), it is not suitable for regression.
+
The comment below the graph ''"I clicked 'Smooth Lines' in {{w|Microsoft Excel|Excel}}."'' refers to the well known spreadsheet application from {{w|Microsoft Office}}. Like other spreadsheet applications it has the feature to visualize data from a table into a graph by many ways. "Smooth Lines" is a setting meant for use on a {{w|line graph}}, a graph in which one axis represents time; as it simply joins up every point rather than finding a sensible line, it is not suitable for regression.
  
 
===Ad-Hoc Filter===
 
===Ad-Hoc Filter===
Line 110: Line 109:
  
 
The comment below the graph ''"As you can see, this model smoothly fits the- wait no no don't extend it AAAAAA!!"'' refers to a curve which fits the data points relatively well within the graph's boundaries, but beyond those bounds fails to match at all.
 
The comment below the graph ''"As you can see, this model smoothly fits the- wait no no don't extend it AAAAAA!!"'' refers to a curve which fits the data points relatively well within the graph's boundaries, but beyond those bounds fails to match at all.
 +
 +
The name is also a reference to the TV show ''{{w|House of Cards (U.S. TV series)|House of Cards}}'' ("WAIT NO, NO, DON'T EXTEND IT!").
  
 
===Cauchy-Lorentz (title text)===
 
===Cauchy-Lorentz (title text)===
Line 116: Line 117:
 
Since so many different models can fit this data set at first glance, Randall may be making a point about how if a data set is sufficiently messy, you can read any trend you want into it, and the trend that is chosen may say more about the researcher than about the data. This is a similar sentiment to [[1725: Linear Regression]], which also pokes fun at dubious trend lines on scatterplots.
 
Since so many different models can fit this data set at first glance, Randall may be making a point about how if a data set is sufficiently messy, you can read any trend you want into it, and the trend that is chosen may say more about the researcher than about the data. This is a similar sentiment to [[1725: Linear Regression]], which also pokes fun at dubious trend lines on scatterplots.
  
A brief Google search reveals that Augustin-Louis Cauchy originally worked as a junior engineer in a managerial position. Upon his acceptance to the Académie des Sciences in March 1816, many of his peers expressed outrage. Despite his early work in "mere" engineering, Cauchy is widely regarded as one of the founding influences in the rigorous study of calculus & accompanying proofs.  Notably, his later work included theoretical physics, and Lorentz was also a well-known physicist.  Therefore, the title-text may be referring back to [[793: Physicists]].
+
A brief Google search reveals that Augustin-Louis Cauchy originally worked as a junior engineer in a managerial position. Upon his acceptance to the Académie des Sciences in March 1816, many of his peers expressed outrage. Despite his early work in "mere" engineering, Cauchy is widely regarded as one of the founding influences in the rigorous study of calculus & accompanying proofs.
  
 
Alternately, the title-text could be implying that the person who applied the Cauchy-Lorentz curve-fitting method may not be well qualified to the task assigned.
 
Alternately, the title-text could be implying that the person who applied the Cauchy-Lorentz curve-fitting method may not be well qualified to the task assigned.

Please note that all contributions to explain xkcd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see explain xkcd:Copyrights for details). Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel | Editing help (opens in new window)