Difference between revisions of "2731: K-Means Clustering"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(Transcript: Complete and adding Public speaking and research categories. Removing math as this is purely statistics and that category is already part of the math category)
m (Transcript: typo)
Line 25: Line 25:
  
 
==Transcript==
 
==Transcript==
:[Ponytail is standing on a podium pointing a stick towards a poster hanging behind her. The writings and figures on the poster are illegible. But there seems to be a large scatter plot at the top with a heading above it. Also a couple of tables beneath this. She addressees an unseen audience in front of the podium.]
+
:[Ponytail is standing on a podium pointing a stick towards a poster hanging behind her. The writings and figures on the poster are illegible. But there seems to be a large scatter plot at the top with a heading above it. Also a couple of tables beneath this. She addresses an unseen audience in front of the podium.]
 
:Ponytail: Our analysis shows that there are three kinds of people in the world:  
 
:Ponytail: Our analysis shows that there are three kinds of people in the world:  
 
:Ponytail: Those who use '''''k'''''-means clustering with k=3, and two other types whose qualitative interpretation is unclear.
 
:Ponytail: Those who use '''''k'''''-means clustering with k=3, and two other types whose qualitative interpretation is unclear.

Revision as of 09:26, 31 January 2023

K-Means Clustering
According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.
Title text: According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: Created by 3 TYPES OF EDITORS - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.

A popular class of wry observations use the snowclone "There are two types of people in the world... those that do A, and those that do [B, usually, though not always, some variant of A]". The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't". And another well known joke is - "There are two types of people in the world - those that can interpolate..."

k-means clustering is a method of categorizing data. To explain how it works, imagine we have a list of people of various heights and weights, and we wish to split the list into 3 groups. One way to do this would be to first plot the data onto a scatter chart. Then, we pick three points at random for reference, then sort the people according to which point they are closest to, forming 3 initial groups. After we have our 3 groups, we then find the average data point of every item in each group, then use those average data points as new reference points to once again categorize all the data into 3 new groups. This process is repeated until the data converges; that is, the data do not change groups even after new reference points are picked.

The K-means algorithm is quite simple, which lends to its popularity, but it has a major drawback: the analyst has to determine how many groups (or clusters) to split the data into (that is, what to set K equal to). A value of K that doesn't match the underlying structure of data can yield a partitioning that's hard to explain in terms of properties that distinguish each cluster (in other words, their qualitative interpretation is unclear).

Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use K=3 as a fixed value, which will inevitably result in three data clusters. However, the joke is that while one group's trait is "uses K=3", this logically means all the data that isn't in the group does not use K=3... except that with two other groups, then that description applies to both, meaning what distinguishes the other two groups from each other is unclear.

The title text refers to a K-means algorithm with an absurdly exaggerated variant of this problem. If the number of clusters is equal to the number of data points, each point will be assigned to a separate cluster; in other words, each member is the sole member of its own group. With such parameters, it makes it impossible to meaningfully comment on similarities between any two members. This is humorous because it would make the result useless for the purposes for which clustering algorithms are typically used, such as making insurance risk pools or targets of advertisement campaigns.

Interestingly, by including the entire human population, the algorithm should be immune to bias in creating its input data. However, since every human is unique,[citation needed] the only way to have the clusters converge is to "throw out" some traits of humans as unimportant. This may be objectionable to humans who disagree with that assessment. In contrast, in a supervised algorithm, the training data is tagged with traits that the trainers seek. These traits could be applied in a manner that is socially unacceptable, and lead to AI behaviour that reflects the biases of the trainers.

Transcript

[Ponytail is standing on a podium pointing a stick towards a poster hanging behind her. The writings and figures on the poster are illegible. But there seems to be a large scatter plot at the top with a heading above it. Also a couple of tables beneath this. She addresses an unseen audience in front of the podium.]
Ponytail: Our analysis shows that there are three kinds of people in the world:
Ponytail: Those who use k-means clustering with k=3, and two other types whose qualitative interpretation is unclear.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

The wikipedia article does not clear anything up 162.158.78.228 13:53, 30 January 2023 (UTC)Bumpf

Yeah. A while back I read a wikipedia article and was determined, for once, to completely understand it. Four years later, I had a PhD in an obscure (and totally useless) element of esoteric math. BTW, it turns out the article was completely wrong! /S 172.70.114.7 12:54, 2 February 2023 (UTC)Bumpf aussi

The "Convergence of k-means" animation is reasonably distinctive for a two-dimensional case, showing at least the motivation for the problem . Could it be attached here? Mia yun Ruse (talk) 14:08, 30 January 2023 (UTC)

Yeah, this is probably the least explanatory Explain xkcd I've read in the past 3 years. Still a lot of heavy math. 162.158.186.95 16:50, 30 January 2023 (UTC)

This feels very similar to the joke "There are 10 types of people: those who know binary and those who don't." Except that the real joke here is that Ponytail doesn't have anything meaningful to justify her version. 172.70.206.150 17:45, 30 January 2023 (UTC)

Current explanation claims that since every human is unique, clusters can only be formed by ignoring some traits. This seems false; a cluster could depend on multiple traits, so there's no obvious limit to the number of traits that could be used when forming clusters. Perhaps they mean that clusters can only be formed by combining non-identical points into the same cluster, but that's literally the entire purpose of clustering and applies to all clustering ever, so it seems like both a trivial observation and a non-sequitur. Am I missing something? 172.70.211.90 19:54, 30 January 2023 (UTC)

Yes, the joke about why there are 8 billion clusters mentioned in the title text. 162.158.78.220 20:47, 30 January 2023 (UTC)
No, I did not miss that. --172.70.211.136 22:53, 30 January 2023 (UTC)
While it's true that clusters can depend on multiple traits, a cluster that depends on ALL human traits at once (or a very large number of them) is useless in practice. A useful cluster depends on a relatively limited number of traits. I think that's where the "ignoring" comes in. 162.158.146.208 22:30, 30 January 2023 (UTC)
Supposing that's true, that would apply to any sample of humans. The "since all humans are unique" part would still be false, and the comment still wouldn't make sense in context as a response to the specific scenario of 8 billion humans. --172.70.211.136 22:53, 30 January 2023 (UTC)
Most people would object to the idea that they are fully defined by their DNA. Yet even taking just DNA, the probability of two humans having same is practically zero. Even identical twins have differences in DNA due to radiation and toxins! Sure, 99% of DNA is identical between all humans (is what makes them human), but DNA is over 6 Gigabase pairs. And how many do you think criminalists needs in DNA identification to ensure match probabilities of 1 in a quintillion? Just hundreds. Yes, every human is unique. -- Hkmaly (talk) 02:50, 31 January 2023 (UTC)
Obviously humans are unique, and I never suggested otherwise. The thing that's false is the complete statement "it's necessary to ignore some traits BECAUSE all humans are unique". I actually think "it's necessary to ignore some traits" is not well-supported even if you stop there, but even if that part is true, it's definitely not a RESULT of all humans being unique. The current explanation reads like someone is twisting the topic to squeeze in a comment about their hobby horse even though it's not actually relevant. --162.158.90.38 00:37, 1 February 2023 (UTC)
It's just wrong to say you have to ignore some traits. I'm a data scientist and I've actually used k-means clustering at my job... everyone *is* unique so, you do lose information when you bucket them, but it isn't because you're throwing out some traits. You're just defining groups based on those traits. If I've got 20 people of all different heights, grouping them into "tall" and "short" is not throwing out height as a trait. The explanation is simply wrong. 172.70.38.77 13:48, 2 February 2023 (UTC)

Many people object to being defined by some group they belong to. E.g. people objecct to blanket statements about members of political parties ("I'm a Republican, but I'm pro-choice"), religions, age groups (the adage "If You Are Not a Liberal at 25, You Have No Heart. If You Are Not a Conservative at 35 You Have No Brain"), etc. I think this is the idea that the title text is going for. Barmar (talk) 20:43, 31 January 2023 (UTC)

There are two types of people in the world: those who use the word “who” to refer to people and the word “that” to refer to things, and those who don’t. 172.71.151.77 02:58, 31 January 2023 (UTC)

...and those whom use "whom"..? 172.70.162.57 09:00, 31 January 2023 (UTC)
Sure, there are plenty who misuse “whom” also. “Who / he / she / they VERB” vs “PREPOSITION whom / him / her / them” - who did, who has, who owns, he did, she has, they own - for whom, by whom, about whom, for him, by her, about them. A person who, a thing that. It’s really not that complicated. 172.71.147.21 10:48, 1 February 2023 (UTC)

Methodological note: k-means is a special case of parametric model-based clustering (here spheres with equal variance) which allows to calculate cluster models with different number of clusters and choose the 'best' one according to the best BIC (Bayesian Information Criterion), see https://cran.r-project.org/package=mclust. A broader non-parametric class of cluster solutions can be fitted with the truecluster meta-algorithm and then choose the one with the best CIC (Cluster Information Criterion), see https://arxiv.org/abs/cs/0601001 and https://arxiv.org/abs/0705.4302. Joehl (talk) 16:55, 2 February 2023 (UTC)

editorial

This sentence clause appears to contain a typo: ", and indicate on a graph of the data has two distinct populations". It might be clearer as ", and indicate on a graph if the data has two distinct populations". (Fixed)