Editing 2048: Curve-Fitting
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 8: | Line 8: | ||
==Explanation== | ==Explanation== | ||
+ | {{incomplete|Please edit the explanation below and only mention here why it isn't complete. Do NOT delete this tag too soon.}} | ||
− | An illustration of several plots of the same data with | + | An illustration of several plots of the same data with curves fitted to the points, paired with conclusions that you might draw about the person who made them. |
− | When modeling | + | When modeling a phenomenon statistically, it is common to search for trends, and fitted curves can help reveal these trends. Much of the work of a data scientist or statistician is knowing which fitting method to use for the data in question. Here we see various hypothetical scientists or statisticians each applying their own interpretations, and the comic mocks each of them for their various personal biases or other assorted excuses. |
− | + | In general, the researcher will specify the form of an equation for the line to be drawn, and an algorithm will produce the actual line. | |
− | + | This comic is similar to [[977: Map Projections]] which also uses a scientific method not commonly thought about by the general public to determine specific characteristics of one's personality and approach to science. | |
+ | |||
+ | Regressions have been the subject of several previous comics. [[1725: Linear Regression]] was about linear regressions on uncorrelated or poorly correlated data. [[1007: Sustainable]] and [[1204: Detail]] depict linear regressions on data that was actually logistic, leading to bizarre extrapolations. [[605: Extrapolating]] shows a line extrapolating from just two data points. | ||
===Linear=== | ===Linear=== | ||
− | + | <math>f(x) = mx + b</math> <p>Linear regression is the most basic form of regression; it tries to find the straight line that best approximates the data.</p><p>As it's the simplest, most widely taught form of regression, and in general derivable function are locally well approximated by a straight line, it's usually the first and most trivial attempt of fit.</p> | |
− | <math>f(x) = mx + b</math> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
===Quadratic=== | ===Quadratic=== | ||
− | <math>f(x) = ax^2 + bx + c</math> | + | <math>f(x) = ax^2 + bx + c</math> <p>Quadratic fit (i.e. fitting a parabola through the data) is the lowest grade polynomial that can be used to fit data through a curved line; if the data exhibits clearly "curved" behavior (or if the experimenter feels that its growth should be more than linear), a parabola is often the first stab at fitting the data.</p> |
− | |||
− | |||
− | |||
− | |||
− | |||
===Logarithmic=== | ===Logarithmic=== | ||
− | + | <math>f(x) = a*\log_b(x) + c</math> <p>A logarithmic curve is typical of a phenomenon whose growth gets slower and slower as time passes (indeed, its derivative - i.e. its growth rate - is <math>\propto \frac{1}{x} \rightarrow 0</math> for <math>x \rightarrow +\infty</math>), but still grows without bound rather than approaching a horizontal asymptote. (If it did approach a horizontal asymptote, then one of the other models subtracted from a constant would probably be better, e.g. <math>f(x) = a - \frac{b}{x}</math> or <math>f(x) = a - b^{-cx}</math>.) If the experimenter wants to find confirmation of this fact, they may try to fit a logarithmic curve.</p> | |
− | <math>f(x) = a\log_b(x)</math> | ||
− | |||
− | A {{ | ||
− | |||
− | |||
===Exponential=== | ===Exponential=== | ||
− | + | <math>f(x) = a*b^x + c</math> <p>An exponential curve, on the contrary, is typical of a phenomenon whose growth gets rapidly faster and faster - a common case is a process that generates stuff that contributes to the process itself, think bacteria growth or compound interest.</p> | |
− | <math>f(x) = a | + | *The logarithmic and exponential interpretations could very easily be fudged or engineered by a researcher with an agenda (such as by taking a misleading subset or even outright lying about the regression), which the comic mocks by juxtaposing them side-by-side on the same set of data. |
− | |||
− | An | ||
− | |||
− | The logarithmic and exponential interpretations could very easily be fudged or engineered by a researcher with an agenda (such as by taking a misleading subset or even outright lying about the regression), which the comic mocks by juxtaposing them side-by-side on the same set of data. | ||
− | |||
− | |||
− | |||
===LOESS=== | ===LOESS=== | ||
− | A {{w|Local regression|LOESS fit}} doesn't use a single formula to fit all the data, but approximates data points locally using different polynomials for each "zone" (weighting data points | + | <math>w(x) = (1-|d|^3)^3</math> (notice: this is just the function used for the weights, not the actually fitted curve formula, as it's a piecewise polynomial) <p>A {{w|Local regression|LOESS fit}} doesn't use a single formula to fit all the data, but approximates data points locally using different polynomials for each "zone" (weighting differently data points as they get further from it) and patching them together</p><p>As it has much more degrees of freedom compared to a single polynomial, it generally "fits better" to any data set, although it is generally impossible to derive any strong, "clean" mathematical correlation from it - it is just a nice smooth line that approximates well the data points, with a good degree of rejection from outliers.</p> |
− | |||
− | |||
===Linear, No Slope=== | ===Linear, No Slope=== | ||
− | <math>f(x) = c</math> | + | <math>f(x) = c</math> <p>Apparently, the person making this line figured out pretty early on that their data analysis was turning into a scatter plot, and wanted to escape their personal stigma of scatter plots by drawing an obviously false regression line on top of it. Alternatively, they were hoping the data would be flat, and are trying to pretend that there's no real trend to the data by drawing a horizontal trend line.</p> |
− | |||
− | |||
− | |||
− | Apparently, the person making this line figured out pretty early on that their data analysis was turning into a scatter plot, and wanted to escape their personal stigma of scatter plots by drawing an obviously false regression line on top of it. Alternatively, they were hoping the data would be flat, and are trying to pretend that there's no real trend to the data by drawing a horizontal trend line. | ||
− | |||
− | |||
− | |||
===Logistic=== | ===Logistic=== | ||
− | + | <math>f(x) = L / (1 + e^{-k(x-b)})</math> <p>A logistic curve provides a smooth, S-shaped transition curve between two flat intervals; indeed the caption says that the experimenter just wants to find a mathematically-respectable way to link two flat lines.</p> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
===Confidence Interval=== | ===Confidence Interval=== | ||
− | Not a type of curve fitting, but a method of depicting the predictive power of a curve. | + | Not a type of curve fitting, but a method of depicting the predictive power of a curve. <p>Providing a confidence interval over the graph shows the uncertainty of the acquired data, thus acknowledging the uncertain results of the experiment, and showing the will not to "cheat" with "easy" regression curves.</p> |
− | |||
− | Providing a confidence interval over the graph shows the uncertainty of the acquired data, thus acknowledging the uncertain results of the experiment, and showing the will not to "cheat" with "easy" regression curves. | ||
− | |||
− | |||
− | |||
===Piecewise=== | ===Piecewise=== | ||
Mapping different curves to different segments of the data. This is a legitimate strategy, but the different segments should be meaningful, such as if they were pulled from different populations. | Mapping different curves to different segments of the data. This is a legitimate strategy, but the different segments should be meaningful, such as if they were pulled from different populations. | ||
− | This kind of fit would arise naturally in a study based on a regression discontinuity design. For instance, if students who score below a certain cutoff must take remedial classes, the line for outcomes of those below the cutoff would reasonably be separate from the one for outcomes above the cutoff; the distance between the end of the two lines could be considered the effect of the treatment, under certain assumptions. This kind of study design is used to investigate causal theories, where mere correlation in observational data is not enough to prove anything. Thus, the associated text would be appropriate; | + | This kind of fit would arise naturally in a study based on a regression discontinuity design. For instance, if students who score below a certain cutoff must take remedial classes, the line for outcomes of those below the cutoff would reasonably be separate from the one for outcomes above the cutoff; the distance between the end of the two lines could be considered the effect of the treatment, under certain assumptions. This kind of study design is used to investigate causal theories, where mere correlation in observational data is not enough to prove anything. Thus, the associated text would be appropriate; �there is a theory, and data that might prove the theory is hard to find. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
===Connecting lines=== | ===Connecting lines=== | ||
− | + | Not useful whatsoever, but it looks nice! | |
− | + | It can be caused by overfitting to the data set or not using curve-fitting tools correctly. | |
− | |||
− | |||
===Ad-Hoc Filter=== | ===Ad-Hoc Filter=== | ||
− | Drawing a bunch of different lines by hand, keeping in only the data points perceived as "good". | + | Drawing a bunch of different lines by hand, keeping in only the data points perceived as "good". Also not useful. |
− | |||
− | |||
− | |||
===House of Cards=== | ===House of Cards=== | ||
− | Not a real method, but a common consequence of | + | Not a real method, but a common consequence of mis-application of statistical methods: a curve can be generated that fits the data extremely well, but immediately becomes absurd as soon as one glances outside the training data sample range, and your analysis comes crashing down "like a house of cards". This is a type of ''overfitting''. In other words, the model may do quite well for (approximately) {{w|Interpolation|interpolating}} between values in the sample range, but not extend at all well to {{w|Extrapolation|extrapolating}} values outside that range. |
− | |||
− | |||
− | |||
− | |||
− | ===Cauchy-Lorentz | + | ===Cauchy-Lorentz=== |
{{w|Cauchy_distribution|Cauchy-Lorentz}} is a continuous probability distribution which does not have an expected value or a defined variance. This means that the law of large numbers does not hold and that estimating e.g. the sample mean will diverge (be all over the place) the more data points you have. Hence very troublesome (mathematically alarming). | {{w|Cauchy_distribution|Cauchy-Lorentz}} is a continuous probability distribution which does not have an expected value or a defined variance. This means that the law of large numbers does not hold and that estimating e.g. the sample mean will diverge (be all over the place) the more data points you have. Hence very troublesome (mathematically alarming). | ||
Since so many different models can fit this data set at first glance, Randall may be making a point about how if a data set is sufficiently messy, you can read any trend you want into it, and the trend that is chosen may say more about the researcher than about the data. This is a similar sentiment to [[1725: Linear Regression]], which also pokes fun at dubious trend lines on scatterplots. | Since so many different models can fit this data set at first glance, Randall may be making a point about how if a data set is sufficiently messy, you can read any trend you want into it, and the trend that is chosen may say more about the researcher than about the data. This is a similar sentiment to [[1725: Linear Regression]], which also pokes fun at dubious trend lines on scatterplots. | ||
− | |||
− | |||
− | |||
− | |||
==Transcript== | ==Transcript== | ||
+ | {{incomplete transcript|Do NOT delete this tag too soon.}} | ||
:'''Curve-Fitting Methods''' | :'''Curve-Fitting Methods''' | ||
:and the messages they send | :and the messages they send | ||
Line 132: | Line 69: | ||
:[The second plot shows a curve falling slightly down and then rising up to the right.] | :[The second plot shows a curve falling slightly down and then rising up to the right.] | ||
:Quadratic | :Quadratic | ||
− | :"I wanted a curved line, so I made one with | + | :"I wanted a curved line, so I made one with Math." |
:[At the third plot the curve starts near the left bottom and increases more and more less to the right.] | :[At the third plot the curve starts near the left bottom and increases more and more less to the right.] | ||
Line 138: | Line 75: | ||
:"Look, it's tapering off!" | :"Look, it's tapering off!" | ||
− | :[The fourth plot shows a curve starting near the left bottom and increases more and more steeper | + | :[The fourth plot shows a curve starting near the left bottom and increases more and more steeper to the right.] |
:Exponential | :Exponential | ||
:"Look, it's growing uncontrollably!" | :"Look, it's growing uncontrollably!" | ||
Line 170: | Line 107: | ||
:"I had an idea for how to clean up the data. What do you think?" | :"I had an idea for how to clean up the data. What do you think?" | ||
− | :[The last plot shows a wave with increasing peak values | + | :[The last plot shows a wave with increasing peak values.] |
:House of Cards | :House of Cards | ||
:"As you can see, this model smoothly fits the- ''wait no no don't extend it AAAAAA!!''" | :"As you can see, this model smoothly fits the- ''wait no no don't extend it AAAAAA!!''" | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{comic discussion}} | {{comic discussion}} | ||
Line 185: | Line 115: | ||
[[Category:Comics with color]] | [[Category:Comics with color]] | ||
[[Category:Scatter plots]] | [[Category:Scatter plots]] | ||
− | |||
[[Category:Math]] | [[Category:Math]] | ||
[[Category:Science]] | [[Category:Science]] |