Talk:2731: K-Means Clustering

Explain xkcd: It's 'cause you're dumb.
Revision as of 16:55, 2 February 2023 by Joehl (talk | contribs) (Added a nethodological note on generalizations of k-means that allow objective determination of the optimal number of clusters given a assumed class of cluster definitions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The wikipedia article does not clear anything up 13:53, 30 January 2023 (UTC)Bumpf

Yeah. A while back I read a wikipedia article and was determined, for once, to completely understand it. Four years later, I had a PhD in an obscure (and totally useless) element of esoteric math. BTW, it turns out the article was completely wrong! /S 12:54, 2 February 2023 (UTC)Bumpf aussi

The "Convergence of k-means" animation is reasonably distinctive for a two-dimensional case, showing at least the motivation for the problem . Could it be attached here? Mia yun Ruse (talk) 14:08, 30 January 2023 (UTC)

Yeah, this is probably the least explanatory Explain xkcd I've read in the past 3 years. Still a lot of heavy math. 16:50, 30 January 2023 (UTC)

This feels very similar to the joke "There are 10 types of people: those who know binary and those who don't." Except that the real joke here is that Ponytail doesn't have anything meaningful to justify her version. 17:45, 30 January 2023 (UTC)

Current explanation claims that since every human is unique, clusters can only be formed by ignoring some traits. This seems false; a cluster could depend on multiple traits, so there's no obvious limit to the number of traits that could be used when forming clusters. Perhaps they mean that clusters can only be formed by combining non-identical points into the same cluster, but that's literally the entire purpose of clustering and applies to all clustering ever, so it seems like both a trivial observation and a non-sequitur. Am I missing something? 19:54, 30 January 2023 (UTC)

Yes, the joke about why there are 8 billion clusters mentioned in the title text. 20:47, 30 January 2023 (UTC)
No, I did not miss that. -- 22:53, 30 January 2023 (UTC)
While it's true that clusters can depend on multiple traits, a cluster that depends on ALL human traits at once (or a very large number of them) is useless in practice. A useful cluster depends on a relatively limited number of traits. I think that's where the "ignoring" comes in. 22:30, 30 January 2023 (UTC)
Supposing that's true, that would apply to any sample of humans. The "since all humans are unique" part would still be false, and the comment still wouldn't make sense in context as a response to the specific scenario of 8 billion humans. -- 22:53, 30 January 2023 (UTC)
Most people would object to the idea that they are fully defined by their DNA. Yet even taking just DNA, the probability of two humans having same is practically zero. Even identical twins have differences in DNA due to radiation and toxins! Sure, 99% of DNA is identical between all humans (is what makes them human), but DNA is over 6 Gigabase pairs. And how many do you think criminalists needs in DNA identification to ensure match probabilities of 1 in a quintillion? Just hundreds. Yes, every human is unique. -- Hkmaly (talk) 02:50, 31 January 2023 (UTC)
Obviously humans are unique, and I never suggested otherwise. The thing that's false is the complete statement "it's necessary to ignore some traits BECAUSE all humans are unique". I actually think "it's necessary to ignore some traits" is not well-supported even if you stop there, but even if that part is true, it's definitely not a RESULT of all humans being unique. The current explanation reads like someone is twisting the topic to squeeze in a comment about their hobby horse even though it's not actually relevant. -- 00:37, 1 February 2023 (UTC)
It's just wrong to say you have to ignore some traits. I'm a data scientist and I've actually used k-means clustering at my job... everyone *is* unique so, you do lose information when you bucket them, but it isn't because you're throwing out some traits. You're just defining groups based on those traits. If I've got 20 people of all different heights, grouping them into "tall" and "short" is not throwing out height as a trait. The explanation is simply wrong. 13:48, 2 February 2023 (UTC)

Many people object to being defined by some group they belong to. E.g. people objecct to blanket statements about members of political parties ("I'm a Republican, but I'm pro-choice"), religions, age groups (the adage "If You Are Not a Liberal at 25, You Have No Heart. If You Are Not a Conservative at 35 You Have No Brain"), etc. I think this is the idea that the title text is going for. Barmar (talk) 20:43, 31 January 2023 (UTC)

There are two types of people in the world: those who use the word “who” to refer to people and the word “that” to refer to things, and those who don’t. 02:58, 31 January 2023 (UTC)

...and those whom use "whom"..? 09:00, 31 January 2023 (UTC)
Sure, there are plenty who misuse “whom” also. “Who / he / she / they VERB” vs “PREPOSITION whom / him / her / them” - who did, who has, who owns, he did, she has, they own - for whom, by whom, about whom, for him, by her, about them. A person who, a thing that. It’s really not that complicated. 10:48, 1 February 2023 (UTC)

Methodological note: k-means is a special case of parametric model-based clustering (here spheres with equal variance) which allows to calculate cluster models with different number of clusters and choose the 'best' one according to the best BIC (Bayesian Information Criterion), see A broader non-parametric class of cluster solutions can be fitted with the truecluster meta-algorithm and then choose the one with the best CIC (Cluster Information Criterion), see and Joehl (talk) 16:55, 2 February 2023 (UTC)


This sentence clause appears to contain a typo: ", and indicate on a graph of the data has two distinct populations". It might be clearer as ", and indicate on a graph if the data has two distinct populations". (Fixed)