Editing Talk:2494: Flawed Data
Please sign your posts with ~~~~ |
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 15: | Line 15: | ||
From what I know, the "very bad" approach is becoming common in data science, see the Wikipedia page for {{w|Synthetic_data#Synthetic_data_in_machine_learning}} or, when done on a single feature at a time, {{w|Imputation_(statistics)}}. The reason imputation can be problematic because the data is missing due to some confounding variable, so trying to fill in based on existing values will bias the results. A slightly related example is for class imbalance, where some groups are underrepresented and therefore won't be predicted as accurately as overrepresented groups. Instead of gathering more data, especially more representative data, data scientists will often use something like SMOTE to generate more data. An example of a widely used but frankly bad synthetic dataset is kddcup99. [[Special:Contributions/172.68.142.147|172.68.142.147]] 05:25, 11 August 2021 (UTC) | From what I know, the "very bad" approach is becoming common in data science, see the Wikipedia page for {{w|Synthetic_data#Synthetic_data_in_machine_learning}} or, when done on a single feature at a time, {{w|Imputation_(statistics)}}. The reason imputation can be problematic because the data is missing due to some confounding variable, so trying to fill in based on existing values will bias the results. A slightly related example is for class imbalance, where some groups are underrepresented and therefore won't be predicted as accurately as overrepresented groups. Instead of gathering more data, especially more representative data, data scientists will often use something like SMOTE to generate more data. An example of a widely used but frankly bad synthetic dataset is kddcup99. [[Special:Contributions/172.68.142.147|172.68.142.147]] 05:25, 11 August 2021 (UTC) | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |