Difference between revisions of "Talk:2048: Curve-Fitting"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(19 intermediate revisions by 13 users not shown)
Line 4: Line 4:
  
 
I'm pretty sure it refers to the TV show house of cards, the dots representing the quality of the series increasing until Netflix renewed it a bit too much {{unsigned ip|172.68.26.65}}
 
I'm pretty sure it refers to the TV show house of cards, the dots representing the quality of the series increasing until Netflix renewed it a bit too much {{unsigned ip|172.68.26.65}}
 +
:This was my initial interpretation as well, since you can hypothetically extend a literal house of cards indefinitely.[[Special:Contributions/172.68.58.83|172.68.58.83]] 14:23, 20 September 2018 (UTC)
 +
 +
Could someone familiar with the show expand on this? ''Also a potential reference to the TV show, House of Cards ("WAIT NO, NO, DON'T EXTEND IT!").'' Some context on what that line meant in House of Cards would be helpful. - [[User:CRGreathouse|CRGreathouse]] ([[User talk:CRGreathouse|talk]]) 14:20, 21 September 2018 (UTC)
  
 
I'm a little mystified by the alt-text. Cauchy and Lorentz both seem like mathematically capable people. What am I missing? [[Special:Contributions/172.69.62.226|172.69.62.226]] 17:46, 19 September 2018 (UTC)
 
I'm a little mystified by the alt-text. Cauchy and Lorentz both seem like mathematically capable people. What am I missing? [[Special:Contributions/172.69.62.226|172.69.62.226]] 17:46, 19 September 2018 (UTC)
Line 12: Line 15:
  
 
:: My own Google-Fu brought me to a page with this information: “The distribution is important in physics as it is the solution to the differential equation describing forced resonance, while in spectroscopy it is the description of the line shape of spectral lines.” (from here: https://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/dist_ref/dists/cauchy_dist.html) [[User:Justinjustin7|Justinjustin7]] ([[User talk:Justinjustin7|talk]]) 18:09, 19 September 2018 (UTC)
 
:: My own Google-Fu brought me to a page with this information: “The distribution is important in physics as it is the solution to the differential equation describing forced resonance, while in spectroscopy it is the description of the line shape of spectral lines.” (from here: https://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/dist_ref/dists/cauchy_dist.html) [[User:Justinjustin7|Justinjustin7]] ([[User talk:Justinjustin7|talk]]) 18:09, 19 September 2018 (UTC)
 +
 +
:: True, but the "check what field I originally worked in" indicates that there might be something else going on with the meaning. [[Special:Contributions/108.162.237.238|108.162.237.238]] 12:47, 20 September 2018 (UTC)
 +
 +
:: I believe the point of "check what field I originally worked in" is that if somebody wasn't trained in statistics using an exotic distribution is highly suspect and suggest that either they are torturing the data to get desired results or have no idea what they are doing. [[Special:Contributions/108.162.246.11|108.162.246.11]] 05:19, 21 September 2018 (UTC)
  
 
To be honest, I'm a bit disappointed. I kinda expected a special comic with such a nice round number.. Been counting down since comic #2000... [[Special:Contributions/162.158.92.184|162.158.92.184]] 18:14, 19 September 2018 (UTC)
 
To be honest, I'm a bit disappointed. I kinda expected a special comic with such a nice round number.. Been counting down since comic #2000... [[Special:Contributions/162.158.92.184|162.158.92.184]] 18:14, 19 September 2018 (UTC)
  
: Different anon here, I think this is very special and if Randall makes a poster available I will be buying several to give away.  Of course, part of my business is experimental data analysis and modeling...and this is a fantastic summary of common errors.
+
Different anon here, I think this is very special and if Randall makes a poster available I will be buying several to give away.  Of course, part of my business is experimental data analysis and modeling...and this is a fantastic summary of common errors. {{unsigned ip|162.158.75.22}}
  
: Agreed. This is a very special comic, and a highly subtle title text. Direct any of your friends who do data analysis here. Sort of the next stage from the classic "correlation is not causation" comic https://xkcd.com/552/ .
+
: Agreed. This is a very special comic, and a highly subtle title text. Direct any of your friends who do data analysis here. Sort of the next stage from the classic "correlation is not causation" comic https://xkcd.com/552/ . {{unsigned|GamesAndMath}}
  
 
'''Curve-Fitting'''
 
'''Curve-Fitting'''
Line 24: Line 31:
  
 
:Generally, you decide for some error function and then search for parameters where the sum of errors for all data points is minimal. -- [[User:Hkmaly|Hkmaly]] ([[User talk:Hkmaly|talk]]) 22:07, 19 September 2018 (UTC)
 
:Generally, you decide for some error function and then search for parameters where the sum of errors for all data points is minimal. -- [[User:Hkmaly|Hkmaly]] ([[User talk:Hkmaly|talk]]) 22:07, 19 September 2018 (UTC)
 +
 
:A typical error function is the square of the difference between the fit and the actual data point, hence "sum of squares" method. There are well-known standard formulas for finding m and b in the case of linear regression. In a linear algebra class, I saw a general method that would work for several of these (any where the fit is y = af(x)+bg(x)+...+ch(x), which includes log, exponential, quadratic, cubic, etc). I wish I could remember it. [[User:Blaisepascal|Blaisepascal]] ([[User talk:Blaisepascal|talk]]) 22:39, 19 September 2018 (UTC)
 
:A typical error function is the square of the difference between the fit and the actual data point, hence "sum of squares" method. There are well-known standard formulas for finding m and b in the case of linear regression. In a linear algebra class, I saw a general method that would work for several of these (any where the fit is y = af(x)+bg(x)+...+ch(x), which includes log, exponential, quadratic, cubic, etc). I wish I could remember it. [[User:Blaisepascal|Blaisepascal]] ([[User talk:Blaisepascal|talk]]) 22:39, 19 September 2018 (UTC)
 +
::I'm still looking for an easy example. Let's say five points (x/y) and then calculating the straight line (without and maybe with the zero-point because this is often the assumed start). Just be simple, everything else derives from that. --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 21:00, 20 September 2018 (UTC)
 +
 +
:I wish we could include the graphics at the top of [https://en.wikipedia.org/wiki/Linear_regression#Introduction] and [https://en.wikipedia.org/wiki/Linear_regression#Interpretation] in the explanation. A lot of people are going to look at this one. [[Special:Contributions/172.68.133.168|172.68.133.168]] 17:51, 20 September 2018 (UTC)
 +
::I've included one picture with a small explanation to the linear regression section. I think that explains it well. --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 21:00, 20 September 2018 (UTC)
  
 
The data points do not have error bars, which makes the choice of fit even more ludicrous, in my opinion.  If the data are that good, then I don't believe there is a correlation, it's random with some distribution.  I might hang this up at work...[[User:Arppix|Arppix]] ([[User talk:Arppix|talk]]) 02:46, 20 September 2018 (UTC)
 
The data points do not have error bars, which makes the choice of fit even more ludicrous, in my opinion.  If the data are that good, then I don't believe there is a correlation, it's random with some distribution.  I might hang this up at work...[[User:Arppix|Arppix]] ([[User talk:Arppix|talk]]) 02:46, 20 September 2018 (UTC)
 +
:And of course in serious science data points have error bars. This makes the fitting even more complicated and should be mentioned at the explanation. Because Randall doesn't use error bars I'm sure he refers to presentations not based on real science. Also this should be mentioned here. --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 21:06, 20 September 2018 (UTC)
 +
 +
I hate to be negative here, as obviously some users have put a lot of effort into explaining the details behind each of the curve-fitting methods, but there's absolutely no explanation for Randall's comments on each method.  While someone might learn something about the various methods by reading the explanation, they would not gain any insight on what Randall is saying about each method.  In addition, the Connecting Lines explanation totally missed the fact that this isn't really even a curve-fitting method - it's just a feature of graphing software (in this case, Excel) where a smooth line is drawn through each data point from left to right rather than an example of overfitting to the data set. I think we could do better. [[User:Ianrbibtitlht|Ianrbibtitlht]] ([[User talk:Ianrbibtitlht|talk]]) 02:53, 21 September 2018 (UTC)
 +
:You're not negative, Randall's comments are missing which I've just added into the incomplete reason. And sure other explanations still need a review. --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 20:32, 21 September 2018 (UTC)
 +
 +
Everyone is missing the deeper trolling here of the fisheries community at large, which shall become blindingly clear here. First, this is cartoon number 2048 (2^11), a highly interesting number. Notably, this is the year all fisheries were projected to be collapsed by Worm et al. (2006) Science 314:787-790, a prediction which gained huge attention in the media and took on a life of its own. The prediction was based on fitting a power curve to some data on collapses in catch trends. Numerous rebuttals followed, one of which pointed out that a linear fit to the data is a better fit, and predicts all fisheries collapsed in 2114 (Jaenike et al. 2007, Science 316:1285a). A list of rebuttals is found here: https://sites.google.com/a/uw.edu/most-cited-fisheries/controversies/2048-projection. Later work by the same author and critics found a different prediction and showed rebuilding of fisheries is likely (Worm et al. 2009 Science 325:578-585). Second, lest you think this is a conspiracy theory, I note that in xkcd cartoon 887, Munroe specifically notes this prediction "The future according to google search results... 2048: "Salt-water fish extinct from overfishing" https://xkcd.com/887/. Third, this kind of model-fitting exercise has long plagued fisheries researchers attempting to predict recruitment from spawning biomass. {{unsigned ip|108.162.246.11}}
 +
 +
"Ad hoc filter: Drawing a bunch of different lines by hand, keeping in only the data points perceived as "good". Also not useful. " – I guess it rather refers to data filtering, where for each point you take several points around and try to calculate some kind of mean, e.g. by rejecting most extreme points, or calculating median (see https://en.wikipedia.org/wiki/Median_filter). So it is an algorithm, not actually drawing lines by hand. Still it is tricky to draw conclusions and you can easily fool yourself with this method. {{unsigned ip|162.158.93.21}}
 +
 +
Anyways, what is the actual regression of the plot? {{unsigned ip|162.158.154.241}}
 +
:This also must be better explained: We don't know what the points represent. The fraction of apples vs. bananas harvested by time, the position of stars in the sky, on a logarithmic scale, linear, or maybe the height of mountains in New Jersey... There are just some dots on paper with no further meaning. Thus everything Randall presents is valid by some means but an actual regression does not exist. --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 20:32, 21 September 2018 (UTC)
 +
 +
Just want to note that the Piecewise models is actually a type of modelling often used in housing economics. It has been used to check if different types of housing are priced according to different rules. [[Special:Contributions/172.68.34.34|172.68.34.34]] 22:05, 21 September 2018 (UTC)
 +
 +
Excel's "smooth lines" are actually splines ([https://blog.splitwise.com/2012/01/31/mystery-solved-the-secret-of-excel-curved-line-interpolation/ third-order Bezier splines, apparently]) so they're not completely without mathematical merit.  Still wildly unsuited for extrapolation, but often very well suited to interpolation. [[User:JohnWhoIsNotABot|JohnWhoIsNotABot]] ([[User talk:JohnWhoIsNotABot|talk]]) 21:44, 24 September 2018 (UTC)
 +
 +
'''Specific functions'''
 +
 +
In both the '''logarithmic''' and '''exponential''' functions, I have deleted the term "+ ''c''" that was present in both. Simply put, these functions do not include an additive constant. To include the constant removes a basic property of e.g. exponential functions, which is that the function should grow by the same factor for equal increases in the value of ''x''. (In other words, if the functions doubles when ''x'' changes from 1 to 2, then it should double again when ''x'' changes from 2 to 3, or from 3 to 4, etc.) If this does not happen, the function is not exponential. [[User:Redbelly98|Redbelly98]] ([[User talk:Redbelly98|talk]]) 19:52, 13 October 2018 (UTC)
 +
 +
 +
'''Logistic Curve'''
 +
 +
The explanation for logistic curve currently says it is used for binary values. It's actually a lot more useful than that. For example, population growth is often described as a logistic curve. It appears to be climbing exponentially initially, but then tapers off as resources can no longer support the population. [[Special:Contributions/108.162.246.191|108.162.246.191]] 15:31, 8 November 2018 (UTC)
 +
:The explanation mentions the {{w|logistic regression}} ranging between "0" and "1". It uses the more general {{w|logistic function}} you probably refer to. The ''logistic regression'' uses in its basic form a ''logistic function'' to model a ''binary'' dependent variable. Both Wikipedia links explain the difference. Honestly, I'm not an expert on that matter but that binary interpretation wouldn't allow values above "1" or below "0" as shown in the picture. Maybe worth to be mentioned. Nonetheless all other fittings are also similar nonsense. Maybe we could mention the more general {{w|Sigmoid function}} but this only barely fits to the title "Logistic Curve". --[[User:Dgbrt|Dgbrt]] ([[User talk:Dgbrt|talk]]) 23:09, 8 November 2018 (UTC)
 +
 +
Personally, I think the exponential fit seems like the most reasonable interpretation of the data.

Revision as of 18:28, 2 September 2019


House of Cards: Not a real method, but a common consequence of mis-application of statistical methods: a curve can be generated that fits the data extremely well, but immediately becomes absurd as soon as one glances outside the training data sample range, and your analysis comes crashing down "like a house of cards". This is a type of _overfitting_

I'm pretty sure it refers to the TV show house of cards, the dots representing the quality of the series increasing until Netflix renewed it a bit too much 172.68.26.65 (talk) (please sign your comments with ~~~~)

This was my initial interpretation as well, since you can hypothetically extend a literal house of cards indefinitely.172.68.58.83 14:23, 20 September 2018 (UTC)

Could someone familiar with the show expand on this? Also a potential reference to the TV show, House of Cards ("WAIT NO, NO, DON'T EXTEND IT!"). Some context on what that line meant in House of Cards would be helpful. - CRGreathouse (talk) 14:20, 21 September 2018 (UTC)

I'm a little mystified by the alt-text. Cauchy and Lorentz both seem like mathematically capable people. What am I missing? 172.69.62.226 17:46, 19 September 2018 (UTC)

Google-Fu reveals that it's a continuous probability distribution. This isn't bad per se, but it is quite visually distinctive and also can be quite...concerning if the data set isn't one where probability should be an issue. Werhdnt (talk) 18:00, 19 September 2018 (UTC)
This is not the issue, but the fact that the moments (such as mean and variance) of the distribution don't exist = converge. See edited explanation. So if you wanted to estimate the parameters of the distribution, taking the sample mean for example will not converge with the number of data points, and is therefore bad to attempt. It is more mathematically alarming than alarmingly mathematical. GamesAndMath
My own Google-Fu brought me to a page with this information: “The distribution is important in physics as it is the solution to the differential equation describing forced resonance, while in spectroscopy it is the description of the line shape of spectral lines.” (from here: https://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/dist_ref/dists/cauchy_dist.html) Justinjustin7 (talk) 18:09, 19 September 2018 (UTC)
True, but the "check what field I originally worked in" indicates that there might be something else going on with the meaning. 108.162.237.238 12:47, 20 September 2018 (UTC)
I believe the point of "check what field I originally worked in" is that if somebody wasn't trained in statistics using an exotic distribution is highly suspect and suggest that either they are torturing the data to get desired results or have no idea what they are doing. 108.162.246.11 05:19, 21 September 2018 (UTC)

To be honest, I'm a bit disappointed. I kinda expected a special comic with such a nice round number.. Been counting down since comic #2000... 162.158.92.184 18:14, 19 September 2018 (UTC)

Different anon here, I think this is very special and if Randall makes a poster available I will be buying several to give away. Of course, part of my business is experimental data analysis and modeling...and this is a fantastic summary of common errors. 162.158.75.22 (talk) (please sign your comments with ~~~~)

Agreed. This is a very special comic, and a highly subtle title text. Direct any of your friends who do data analysis here. Sort of the next stage from the classic "correlation is not causation" comic https://xkcd.com/552/ . -- GamesAndMath (talk) (please sign your comments with ~~~~)

Curve-Fitting

How fitting works needs to be explained. f(x)=mx+b works fine for single values, but how do we get that red line from the data set? --Dgbrt (talk) 20:12, 19 September 2018 (UTC)

Generally, you decide for some error function and then search for parameters where the sum of errors for all data points is minimal. -- Hkmaly (talk) 22:07, 19 September 2018 (UTC)
A typical error function is the square of the difference between the fit and the actual data point, hence "sum of squares" method. There are well-known standard formulas for finding m and b in the case of linear regression. In a linear algebra class, I saw a general method that would work for several of these (any where the fit is y = af(x)+bg(x)+...+ch(x), which includes log, exponential, quadratic, cubic, etc). I wish I could remember it. Blaisepascal (talk) 22:39, 19 September 2018 (UTC)
I'm still looking for an easy example. Let's say five points (x/y) and then calculating the straight line (without and maybe with the zero-point because this is often the assumed start). Just be simple, everything else derives from that. --Dgbrt (talk) 21:00, 20 September 2018 (UTC)
I wish we could include the graphics at the top of [1] and [2] in the explanation. A lot of people are going to look at this one. 172.68.133.168 17:51, 20 September 2018 (UTC)
I've included one picture with a small explanation to the linear regression section. I think that explains it well. --Dgbrt (talk) 21:00, 20 September 2018 (UTC)

The data points do not have error bars, which makes the choice of fit even more ludicrous, in my opinion. If the data are that good, then I don't believe there is a correlation, it's random with some distribution. I might hang this up at work...Arppix (talk) 02:46, 20 September 2018 (UTC)

And of course in serious science data points have error bars. This makes the fitting even more complicated and should be mentioned at the explanation. Because Randall doesn't use error bars I'm sure he refers to presentations not based on real science. Also this should be mentioned here. --Dgbrt (talk) 21:06, 20 September 2018 (UTC)

I hate to be negative here, as obviously some users have put a lot of effort into explaining the details behind each of the curve-fitting methods, but there's absolutely no explanation for Randall's comments on each method. While someone might learn something about the various methods by reading the explanation, they would not gain any insight on what Randall is saying about each method. In addition, the Connecting Lines explanation totally missed the fact that this isn't really even a curve-fitting method - it's just a feature of graphing software (in this case, Excel) where a smooth line is drawn through each data point from left to right rather than an example of overfitting to the data set. I think we could do better. Ianrbibtitlht (talk) 02:53, 21 September 2018 (UTC)

You're not negative, Randall's comments are missing which I've just added into the incomplete reason. And sure other explanations still need a review. --Dgbrt (talk) 20:32, 21 September 2018 (UTC)

Everyone is missing the deeper trolling here of the fisheries community at large, which shall become blindingly clear here. First, this is cartoon number 2048 (2^11), a highly interesting number. Notably, this is the year all fisheries were projected to be collapsed by Worm et al. (2006) Science 314:787-790, a prediction which gained huge attention in the media and took on a life of its own. The prediction was based on fitting a power curve to some data on collapses in catch trends. Numerous rebuttals followed, one of which pointed out that a linear fit to the data is a better fit, and predicts all fisheries collapsed in 2114 (Jaenike et al. 2007, Science 316:1285a). A list of rebuttals is found here: https://sites.google.com/a/uw.edu/most-cited-fisheries/controversies/2048-projection. Later work by the same author and critics found a different prediction and showed rebuilding of fisheries is likely (Worm et al. 2009 Science 325:578-585). Second, lest you think this is a conspiracy theory, I note that in xkcd cartoon 887, Munroe specifically notes this prediction "The future according to google search results... 2048: "Salt-water fish extinct from overfishing" https://xkcd.com/887/. Third, this kind of model-fitting exercise has long plagued fisheries researchers attempting to predict recruitment from spawning biomass. 108.162.246.11 (talk) (please sign your comments with ~~~~)

"Ad hoc filter: Drawing a bunch of different lines by hand, keeping in only the data points perceived as "good". Also not useful. " – I guess it rather refers to data filtering, where for each point you take several points around and try to calculate some kind of mean, e.g. by rejecting most extreme points, or calculating median (see https://en.wikipedia.org/wiki/Median_filter). So it is an algorithm, not actually drawing lines by hand. Still it is tricky to draw conclusions and you can easily fool yourself with this method. 162.158.93.21 (talk) (please sign your comments with ~~~~)

Anyways, what is the actual regression of the plot? 162.158.154.241 (talk) (please sign your comments with ~~~~)

This also must be better explained: We don't know what the points represent. The fraction of apples vs. bananas harvested by time, the position of stars in the sky, on a logarithmic scale, linear, or maybe the height of mountains in New Jersey... There are just some dots on paper with no further meaning. Thus everything Randall presents is valid by some means but an actual regression does not exist. --Dgbrt (talk) 20:32, 21 September 2018 (UTC)

Just want to note that the Piecewise models is actually a type of modelling often used in housing economics. It has been used to check if different types of housing are priced according to different rules. 172.68.34.34 22:05, 21 September 2018 (UTC)

Excel's "smooth lines" are actually splines (third-order Bezier splines, apparently) so they're not completely without mathematical merit. Still wildly unsuited for extrapolation, but often very well suited to interpolation. JohnWhoIsNotABot (talk) 21:44, 24 September 2018 (UTC)

Specific functions

In both the logarithmic and exponential functions, I have deleted the term "+ c" that was present in both. Simply put, these functions do not include an additive constant. To include the constant removes a basic property of e.g. exponential functions, which is that the function should grow by the same factor for equal increases in the value of x. (In other words, if the functions doubles when x changes from 1 to 2, then it should double again when x changes from 2 to 3, or from 3 to 4, etc.) If this does not happen, the function is not exponential. Redbelly98 (talk) 19:52, 13 October 2018 (UTC)


Logistic Curve

The explanation for logistic curve currently says it is used for binary values. It's actually a lot more useful than that. For example, population growth is often described as a logistic curve. It appears to be climbing exponentially initially, but then tapers off as resources can no longer support the population. 108.162.246.191 15:31, 8 November 2018 (UTC)

The explanation mentions the logistic regression ranging between "0" and "1". It uses the more general logistic function you probably refer to. The logistic regression uses in its basic form a logistic function to model a binary dependent variable. Both Wikipedia links explain the difference. Honestly, I'm not an expert on that matter but that binary interpretation wouldn't allow values above "1" or below "0" as shown in the picture. Maybe worth to be mentioned. Nonetheless all other fittings are also similar nonsense. Maybe we could mention the more general Sigmoid function but this only barely fits to the title "Logistic Curve". --Dgbrt (talk) 23:09, 8 November 2018 (UTC)

Personally, I think the exponential fit seems like the most reasonable interpretation of the data.