Editing Talk:2494: Flawed Data

Jump to: navigation, search
Ambox notice.png Please sign your posts with ~~~~

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 10: Line 10:
  
 
On first reading, I was thinking the Good approach would be to go out and run new experiments and measurements, incorporating the lessons from their flawed data, to avoid making the same mistakes again.  This can be quite expensive, but it is really the only way to increase the validity of the data.  Just saying "We can't trust our conclusions," throws away the opportunity to learn from earlier mistakes and come up with better measurements next time.  [[User:Nutster|Nutster]] ([[User talk:Nutster|talk]]) 14:38, 28 July 2021 (UTC)
 
On first reading, I was thinking the Good approach would be to go out and run new experiments and measurements, incorporating the lessons from their flawed data, to avoid making the same mistakes again.  This can be quite expensive, but it is really the only way to increase the validity of the data.  Just saying "We can't trust our conclusions," throws away the opportunity to learn from earlier mistakes and come up with better measurements next time.  [[User:Nutster|Nutster]] ([[User talk:Nutster|talk]]) 14:38, 28 July 2021 (UTC)
 
 
Is this a reference to Biogen? Doing some motivated post hoc subgroup analysis to get Aduhelm approved.
 
 
From what I know, the "very bad" approach is becoming common in data science, see the Wikipedia page for {{w|Synthetic_data#Synthetic_data_in_machine_learning}} or, when done on a single feature at a time, {{w|Imputation_(statistics)}}. The reason imputation can be problematic because the data is missing due to some confounding variable, so trying to fill in based on existing values will bias the results. A slightly related example is for class imbalance, where some groups are underrepresented and therefore won't be predicted as accurately as overrepresented groups. Instead of gathering more data, especially more representative data, data scientists will often use something like SMOTE to generate more data.  An example of a widely used but frankly bad synthetic dataset is kddcup99.  [[Special:Contributions/172.68.142.147|172.68.142.147]] 05:25, 11 August 2021 (UTC)
 
 
Super Bad: So we used a RNG to make completely random data
 
 
Ultra Bad: So I just picked my favorite numbers to use as data
 
 
 
 
 
 
 
 
The joke here, is, that adding the artificial intelligence variable is going to get things really screwed up!
 

Please note that all contributions to explain xkcd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see explain xkcd:Copyrights for details). Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel | Editing help (opens in new window)

Template used on this page: