Title text: I didn't even realize you could HAVE a data set made up entirely of outliers.
The comic shows Cueball presenting data that was probably gathered in research. It's not clear what type of data it is, but one spike has been highlighted on the graph, despite this spike being apparently no larger than the noise in the data (and is much smaller than the central peak). Cueball seems to have made some kind of mistake in either the statistics or the measurement of the undefined subject of his research, thus his data results in many outliers. The word artifact is a wordplay with two meanings. It is either an archaeological artifact (such as the Holy Grail as in Indiana Jones and the Last Crusade) or a fault in your experiment, where you (usually accidentally) influence the measurement with your equipment or unanticipated environmental factors. These are called error artifacts.
Indiana Jones is (often humorously) cited as being a bad archaeologist. He often destroys the area he is looking for artifacts in, despite the context in which they were found being as or more important, archaeologically, than the artifacts themselves. He does not appear to make any records, carries the artifacts around without any thought for their ancient and fragile nature, and most often ends up losing the artifacts altogether.
An example of an error artifact is the measurement of the force between two charged metal spheres (Coulomb force), where the potential of unearthed nearby objects influences the measurement, thus causing an artifact. Artifacts have been mentioned before in xkcd, as in 1453: fMRI, where getting into the MRI machine induced unintended effects, such as thoughts of claustrophobia.
The title text refers to the entire data set being "outliers." In statistics, an outlier is an observation point that is distant from other observations. One way to have a data set composed entirely of outliers would be a data set with N points, in a 1/2 N-dimensional space, where each point is zero for every dimension except one, unique to itself. The 1/2 is because there would also be a -1 point. All these points are equidistant from each other.
We could also infer that the accusation is a jab at the fact that the data points are all over the place; a good example of such chaotic data can be see in 1725: Linear Regression.
- [Cueball is standing on a podium pointing at his presentation which includes a large line graph in the center part. There is plenty of text on the presentation, but none of it is readable. The central part of the line is raised high above the left and right part. The point where the line drops towards right is highlighted with a circle, with a double arrow above it pointing to a caption. There is also text next to the circle to the right. Above the graph there are three smaller panels with drawings. There is one caption above these, and also one above the large graph. Below the graph there are two smaller panels with curves, each panel has it's own caption. Cueball addresses an unseen audience, and one from the audience interrupts him.]
- Cueball: The data clearly proves that-
- Offscreen voice: Are you Indiana Jones?
- Offscreen voice: Because you've got a lot of artifacts there, and I'm pretty sure you didn't handle them right.
add a comment! ⋅ add a topic (use sparingly)! ⋅ refresh comments!
Wouldnt data entirely made of outliners just be ..regular measurements that just yields different results?#GoWest-West (talk) 13:59, 4 January 2017 (UTC)
One possibility for the alt-text scenario: Consider an n-dimensional dataset consisting of n points. Arbitrarily assign total orders to the data points and the dimensions. For the most part, every measurement is drawn from a standard Gaussian with mean 0 stdev 1, except the ith dimension of the ith point has a value of n. 22.214.171.124 (talk) (please sign your comments with ~~~~)
- Though this is really fascinating idea, I think that it is not completely correct. You would need to define outliers in each dimension separately. If you's use n-dimensional distance, the points will be all roughly equidistant from the mean. --126.96.36.199 10:42, 5 January 2017 (UTC)
- I think therefore that "One way to have a data set composed entirely of outliers would be a data set with N points, in an N-dimentional space, where each point is zero for every dimension except one, unique to itself. All these points are equidistant from each other." should be removed from the text. In an equidistant data set, no point is an outlier.--188.8.131.52 10:50, 5 January 2017 (UTC)
- Good point. I myself noted that in 1 Dimension, this is completely untrue, so I added a -1 point as well. Just saying, that was me. That's right, Jacky720 just signed this (talk | contribs) 16:07, 5 January 2017 (UTC)
The graph that Cueball is showing looks like the graph from the EM drive paper. Maybe Randall is poking fun at the EM drive with this comic? Cgplover (talk) 14:15, 4 January 2017 (UTC)
It does look like the Full Resonance tuner sweep graph 184.108.40.206 15:12, 4 January 2017 (UTC)
Why the emphasis on HAVE in the alttext instead of, say, ENTIRELY? 220.127.116.11 (talk) (please sign your comments with ~~~~)
- I see no issue with this. The speaker is clearly focusing on the probability of the situation. If anything, I'd say that this emphasis is intended to underline the competence, or lack thereof, of the researcher, which is in line with the mocking tone previously given. Not emphasizing HAVE would more indicate the speaker is accepting of the results, but is still surprised by them. 18.104.22.168 15:40, 4 January 2017 (UTC)
Is there also a suggestion that Indiana Jones didn't properly handle artifacts he dealt with? 22.214.171.124 (talk) (please sign your comments with ~~~~)
- Depends... Does dropping the Holy Grail down a crevice count as "not properly"? 126.96.36.199 15:40, 4 January 2017 (UTC)
- I also think that that could be a reference to him holding an artifact while running from that giant boulder. Could be. IDK. --JayRulesXKCD (talk) 15:58, 4 January 2017 (UTC)
I have the feeling that I've seen this comic before. Is there another comic where Cueball gives a presentation and is then dissed by his audience? 188.8.131.52 15:36, 4 January 2017 (UTC)
- I think you are referring to the one where he is talking about emoticons and parentheses (for example, :)), then gets kicked out of the convention center. --JayRulesXKCD (talk) 16:35, 4 January 2017 (UTC)
- Yeah, check out #410: Math Paper and #323 Ballmer Peak, see if those ring a bell. And as Jay mentions, there is also TED Talk.184.108.40.206 20:02, 4 January 2017 (UTC)
To me, the point of the comic is the mistake in the first sentence. "Data" is plural and so the correct wording would have been "the data clearly prove that...". The last sentence points out the error -- there are lots of items on the poster and he didn't handle them correctly -- as a plural -- in the initial statement. The capitalization of HAVE also seems to be a clue that "plural" is the theme ("it has" versus "they have"). Ibid (talk) 16:19, 4 January 2017 (UTC)
- I'm pretty sure that argument has been addressed in a previous comic, or at least something similar. Linguistic drift changes the way words are used, and as long as the listener understands the speaker, there isn't really a reason to correct it. Also, it's more of a collective term than plural, which in American English use singular parts of speech. Plus, I'm of the camp that believes that loanwords should be treated as part of the language they are joining, rather than the one they are from. English is complicated enough with its Germanic, Greek, Latin, and specifically French components all contradicting each other on how they should be spelled and pronounced. --KingStarscream (talk) 16:50, 4 January 2017 (UTC)
- As far as the point of the comic being about him using the word incorrectly, that doesn't seem likely considering that the heckler talks about the data chart in the alt text as well. Using a word incorrectly wouldn't be considered an artifact, though the supposition about how it should be used can be in a way. As for the capitalization, it's for emphasis and sarcasm. --KingStarscream (talk) 17:03, 4 January 2017 (UTC)
- I don't think it's even relevant to quip on grammar in this explanation. Besides that, "data" here refers to the singular object of "collection of data", and as such I would think "the data proves" is most correct. --220.127.116.11 19:48, 4 January 2017 (UTC)
- Working in a field that uses lots of data and often uses the word "data" in formal publications, I concur with others that it is commonly and acceptably used as a "group noun" which is treated as singular. While datum is sometimes used as a technical term (I most often see it referencing a fixed line or plane used as a reference in geometry or Computer Aided Design), it is almost never used as the singular for "data." Whenever it begins to be tempting to treat it as plural and an editorial argument breaks out, I often recommend changing to "data point" or "data set" or similar for clarity. My point is that a grammatical debate here is pedantic, moot, and unrelated to the comic. 18.104.22.168 19:59, 4 January 2017 (UTC)
- Also we already know that Randall Munroe pokes fun at grammar pedants for this exact word from his comic "Data". 22.214.171.124 20:23, 4 January 2017 (UTC)
- Artifacts versus artifacts (artefacts?)
When I first read this I thought it was referencing image compression artifacts. Like he has a chunk of visual aid onscreen but it's all blocky and blurry and stuff. All the statistics stuff mentioned here didn't even cross my mind. 126.96.36.199 23:01, 4 January 2017 (UTC)
To whoever edited the title, topic OP here: artefact is the Brit spelling, artifact the North American one. As for me, I'm a Canada-Brit dual citizen who uses S's a lot ("stigmatised") but will miss the occasional Brittier spelling. 188.8.131.52 10:22, 5 January 2017 (UTC)
I also thought the comic was about JPEG Compression Artifacts! 184.108.40.206 02:32, 6 January 2017 (UTC)