1301: File Extensions

Explain xkcd: It's 'cause you're dumb.
(Redirected from 1301)
Jump to: navigation, search
File Extensions
I have never been lied to by data in a .txt file which has been hand-aligned.
Title text: I have never been lied to by data in a .txt file which has been hand-aligned.

Explanation[edit]

Computer file names often end in file extensions like ".ppt" or ".exe". These extensions are a holdover from early operating systems like DOS in which filenames had a maximum eight characters followed by a period and the three-character extension. The extension was used by the operating system to determine filetype so that the system would know how to handle the file (e.g. which program could open the file). Newer operating systems and file systems now accept longer-than eight-character filenames, and extensions of greater than three characters; although most extensions remain three characters.

Most extensions are created as proprietary to certain pieces of software, although software by other developers may later be designed to be able to read the format. For example, .doc is a Microsoft Word document, although because of that software's popularity, many word processors include the ability to open .doc files. Some common file extensions are not proprietary to a piece of software and may be handled by various programs; .jpg or .gif images are examples. In either case, a file's extension is generally a good indicator of what type of data the file contains.

Certain file types are more prevalent for certain uses, with some being almost exclusive to one use, while other are in general use and might contain almost anything. Here, Randall presents a series of file extensions which often contain information, and he is rating the reliability of the information they generally contain from most reliable to least.

  • .tex files are source files for the programs TeX and LaTeX, which are used often and almost exclusively by academics, especially in mathematics and the hard sciences. .tex pretty much means serious business, and Randall does not anticipate that anyone would use such a format other than for reliable information.
  • .pdf files are a portable (as in over the web) document format by Adobe, frequently used for publication. Companies use them for official documentation. Thus, a .pdf file is likely to be some type of final product or polished work. Further, .tex files are generally compiled into .pdf files in order to make them readable. It would be strange to trust a .tex file without trusting the .pdf to which it compiles. For example, when submitting to academic journals in math and the hard sciences, the journal accepts the .tex file, but then compiles it and publishes the resulting .pdf. On the other hand, software which can produce a .doc/.xls(x), as described below, these days tends to have an inbuilt or addable ability to "Export to PDF", with the promise of slightly more read-onlyness and localisation-immunity than the .doc, so it might arise - in good faith or otherwise - from a less professional editor trying to look a little more serious about the copy they distribute in this document format.
  • .csv are comma-separated values: tables of information delimited by commas, and often consist of computer-generated raw data (from, say, a scientific experiment or a database).
  • .txt files contain only plain text, no "rich text" or anything fancy. Programmers often use them for README files. The txt format indicates that the creator prioritizes recording the information over making the information visually appealing, although ASCII art images or multiline 'bannering' of text might be included by some authors.
  • .svg files are a (scalable) vector graphics format used a lot for diagrams, such as on Wikipedia.
  • .xls and .xlsx files are spreadsheets used and created by the program Microsoft Excel, part of a bundle of applications known as Microsoft Office (also supported by compatible free software such as LibreOffice). These applications are very commonly used, especially for business, finance and data analysis tasks. .xls is a binary format used for Excel versions up to 2003, while .xlsx is a ZIPped XML-based format used for Excel versions 2007 and later.
  • .doc files are a rich-text document format used and created by the program Microsoft Word, another application in the Microsoft Office bundle. As with .xls, almost anyone with access to Microsoft Office could easily make one of these. While Excel is generally used for creating tables and presenting data, Word could be used for any text-based document. Thus, Word documents tend to be far more prevalent and casually created than Excel documents, which is presumably why Randall doesn't trust them as much.
  • .png files are a bitmap image format designed for the Internet. They enjoy wide popularity for providing crisp, full-color images with lossless (reversible) compression. Almost all xkcd comics, this diagram included, use PNG. But, since anyone can create an image (you can draw something online and it will use .png), Randall rates this type as not very trustworthy.
  • .ppt files are used and created by the program Microsoft PowerPoint; as with the other two Office applications, almost anyone could easily make one of these. As they are usually used for presentations rather than documents, the information in them may be arranged differently, possibly to "dumb down" the content, or in marketing materials or talks in which the author may not be very objective. Further, several years ago, PowerPoint presentations were sometimes included instead of plain images as attachments in e-mail forwards containing inaccurate information. These emails still occasionally circulate, and may be the source of Randall's distrust.
  • .jpg files are another image format with high compression capabilities, good for storing photos and not so good for many other things. Photographs in general are prone to image manipulation, hence Randall's low score for this file format.
  • .jpeg files are the same thing as .jpg files, but these are more likely to have been created manually rather than automatically, making them even less reliable.
  • .gif files are yet another bitmap image format, notable for supporting short animations. GIF was once the Internet image file format until PNG gradually replaced it. Since GIF is the only common image format capable of animation, it is often used to contain things like silly clips of cats falling into boxes, or annoying, blinking advertisements claiming that "you're the 100,000,000th VISITOR!". GIFs are also created by Internet trolls, such as on 4chan.org, to feed misinformation to gullible gamers and other computer users. For example, a recent Xbox One Hoax GIF contained instructions that were said to make the Xbox One backwards compatible with Xbox 360 games, but would actually make the console inoperable.

Note that while the extensions .xls/.xlsx, .doc, and .ppt were originally exclusive only to Microsoft Office and users of Windows, there now exist a number of open source programs such as Open Office, Libre Office, and some Android apps that are capable of editing such files. These programs can run on systems other than just Windows, such as Linux, perhaps contributing to making them even more widespread and easy to make than before.

The title text refers to how .txt files contain only plain text and nothing else, meaning that any alignment (such as for indentation, tables, or justification) would have to be performed manually by adding in spaces or tabs. Anyone who would go through such an effort to improve their text's readability is likely to be trustworthy, and almost by definition, the opinion presented would be justified.

Transcript[edit]

[Caption above the bar chart:]
Trustworthiness of Information by File Extension
[A line is going down and from that gray bars charting the trustworthiness in a bar graph that goes both left and right of the line. No units or figures are given. For ease of comprehension this transcript will arbitrarily designate the highest score as [+100]; subsequent scores are estimates based on the size of their bars.]
[+100]: .tex
[+89]: .pdf
[+85]: .csv
[+67]: .txt
[+65]: .svg
[+49]: .xls/.xlsx
[+21]: .doc
[+15]: .png
[+14]: .ppt
[+3]: .jpg
[-8]: .jpeg
[-36]: .gif

Trivia[edit]

The various extensions are, for the most part, abbreviations of the file type.

  • .tex isn't short for anything, TeX (that lowercase e is very important) is in fact the full name of the program
  • .pdf is an acronym for Portable Document Format
  • .csv is an acronym for Comma-Separated Values
  • .txt is short for "text" - the 8.3 format meant the vowel was dropped
  • .svg is an acronym for Scalable Vector Graphics
  • .xls is short for eXceL Sheet (it's also why Microsoft Excel has an "X" on its icon rather than an "E")
  • The extra x in .xlsx (.docx and .pptx) refers to the upgrade from binary to ZIPped XML for those formats
  • .doc is short for DOCument
  • .ppt is short of PowerPoinT presentation
  • .png is an acronym for Portable Network Graphics
  • .jpg is short for .jpeg - the 8.3 format again removed the vowel
  • .jpeg is an acronym for Joint Photographic Experts Group, the organization that created the standard
  • .gif is an acronym for Graphics Interchange Format


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

Tex is a Turing complete language so when it compiles to a PDF it could hide malicious code. 141.101.98.154 (talk) 13:16, 29 September 2017 (UTC) (please sign your comments with ~~~~)

No, that's wrong! Turing-completeness refers to what calculations a system can perform; it doesn't say anything about how an implementation of that system can interact with another system that hosts it. Conway's Game of Life is Turing-complete, but you'd never imagine a Life board to be an attack vector. (Now, if your TeX compiler has a vulnerability, that's another issue.) 162.158.74.87 13:18, 19 August 2019 (UTC)

The title text reference of "hand-aligned data" may refer to ASCII art. 108.162.215.28 05:36, 9 December 2013 (UTC) Alan K.

I'd think not, given that art isn't exactly data. My guess would be tables in the .txt - a .txt file is just raw text with no formatting, so putting a table in requires manually formatting it with a bunch of spaces/tabs. It's not hard, but can be time-consuming and obnoxious. 108.162.219.47 23:57, 10 December 2013 (UTC)
Any programmer would tell you to never try and hand-align things with tabs. Different text editors will use anything from 3 to 8 spaces for a tab, meaning that what's aligned in your editor isn't in others. 108.162.236.13 (talk) 14:09, 21 December 2013 (UTC) (please sign your comments with ~~~~)
Indent with tabs; align with spaces. More formally, tabs should only be at the beginning of a line, and should have a strong contextual relationship with the surrounding text. This is a reasonably thought out explanation: http://lea.verou.me/2012/01/why-tabs-are-clearly-superior/ 199.27.128.67 17:18, 19 February 2014 (UTC)

I think it's also a notable point, that the better rated document formats are more data centric while the low rated formats mix text informations with design elements and finally become pure graphic formats, which often is an indication, that the author didn't use the accurate file type for (mostly) pure text informations.

Something I don't understand is the gap between jpg and jpeg. The first suffix is AFAIK only an abbreviation used by older DOS/MS Systems to fullfill the 8.3 limitation for filenames. The note about hand alignment might concern the fact, that hand alignment is more time expensive which might increase the amount of the the author spend in overthink the content before layouting. Also often automated layouting as supported by many modern writing application might lead to unexpected and sometimes wrong results, because the automatism has no semantical knowledge about the authors intention, which might lead to post processed errors

Sorry for my bad english, I'm not a natural writer 108.162.231.239 05:45, 9 December 2013 (UTC)

"hand-aligned data" seems to me like (manually) space-indented paragraphs, perhaps even manual padding to achieve the desired justification (centering and right-and-left-margin-hugging). And of course neatly lining up an 'embedded table', perhaps originally extracted from a .csv output. Although a number of plain-text editors (in the days of CGA and pure terminal/fixedspace fonts) or text formatters and wrappers (e.g. Lynx, man-page creaters, etc) would do things like this for you. And still do. At least insofar as the justification and margining is concerned. 141.101.99.229 08:35, 9 December 2013 (UTC)
If anyone has taken the time to hand align a text file (as in a README, or other info file), they want it to look attractive for people to read. Odd are you're not going to take the time to "hand pretty" the document just to be malicious. Back in the BBS days there were a large number of "online" groups who had "signature" text files which were (very probably) hand aligned, and made extensive use of extended ASCII codes to generate basic graphics. (Granted there were programs to help auto-generate "ascii art".) If you've ever seen these files you'd know. Example 1 - Example 2 Jarod997 (talk) 14:14, 9 December 2013 (UTC)
I thought hand aligned meant an image of a .txt file that was hand-rotated, kinda like the .NORM file joke.172.70.110.65 18:00, 19 March 2022 (UTC)Bumpf

I find it interesting that .jpg and .jpeg are at different levels. Aren't those the same thing? --Mralext20 (talk) 05:48, 9 December 2013 (UTC)

Perhaps the .gif could contain suddenly unexpected scary/surprising frames? 108.162.208.172 14:54, 9 December 2013 (UTC)
That JPG/JPEG thing indeed seems strange. The more important distinction is between JPEGs that are photographs (fine) and those that are not (stupid). Also, pre-PNG, non-photograph GIFs could be just fine. And with all the accounting scandals we've seen, why would those spreadsheet formats get any credibility? -- Dfeuer (talk) 06:06, 9 December 2013 (UTC)
Alongside .jpeg ('full' extension format) and .jpg (MS '8.3'-compatible extension format), I'd have expected .jpe (often full extension historically truncated on an 8.3 system), I must be honest. (And interesting that .docx doesn't co-inhabit the .doc line... or be somewhere else.) And the disparity betwixt the two versions of JPEG extension may relate to the tendency for a higher artefact-intensity of images back in the early days (when a better option than GIFs for... certain pictures... e.g. on Usenet between *nix workstations with vastly restricted bandwidths and storage capacities) compared to today's users (cameras that regularly store 10+MP pictures in low-loss JFIF files, and/or in Raw format!). But that may be a spurious or off-track reasoning on my part. 141.101.99.229 08:27, 9 December 2013 (UTC)

I measured the bars in photoshop to +/- 2pixels. If we scale .tex to a value of 100 like the transcript says, these are the values I get for the bar lengths (rounded to one decimal place)

  • .tex: 100
  • .pdf: 89.4
  • .csv: 84.9
  • .txt: 66.5
  • .svg: 64.8
  • .xls: 48.6
  • .doc: 21.2
  • .png: 15.1
  • .ppt: 14.5
  • .jpg: 3.4
  • .jpeg: -8.4
  • .gif: -35.8

Dunno if it is helpful - or even trusted given I'm a first time commenter - but there it is. Closer values than just estimating, though the eyeballed estimates aren't bad. Not going to adjust the actual transcript because I feel that's overstepping my bounds. 108.162.216.56 (talk) 07:34, 9 December 2013 (UTC) (please sign your comments with ~~~~)

Not at all, wikis are free to edit for a reason. If we didn't want new users to be editing pages, we could have turned that off long ago. Davidy²²[talk] 07:55, 9 December 2013 (UTC)

As the information that is provided by the graph comes as png, we should probably not trust her. --141.101.92.120 09:03, 9 December 2013 (UTC)

Ha, +1 Like :-) Spongebog (talk) 14:02, 9 December 2013 (UTC)

I never saw image of cute cats lying to me ... I mean, the gif is STILL the preferred format for animation, mostly because it's the only one supported. Animation formats based on PNG didn't catched up, hard to say why ... on the other hand, gif animation apparently have huge number of weird extensions, judging by the number of animated images I found which don't render properly in anything EXCEPT the browser. -- Hkmaly (talk) 10:27, 9 December 2013 (UTC)

The cute cat may not be lying, but since the format is used in other context -- like banner ads, then the average GIF may well be lying, also I believe there have been many security issues with GIFs and JPGs as they have been used as an attack vector for internet-bad-guys to take over your computer -- so while security issues is not specifically the topic for todays strip, then that may be worth noticing as well Spongebog (talk) 14:02, 9 December 2013 (UTC)
It is also possible to create animations with svg which is (for good reason, I like that format) ranked higher. Especially for scientific purposes it can be handy. Unfortunately is the MediaWiki software unable to show them. For example in the previous comic is an animation of the Galilean moons shown. That is an gif but someone also uploaded an svg animation and I would say it does look smoother than the gif. 108.162.231.215 14:40, 9 December 2013 (UTC)
The Grumpy Cat is not grumpy in real life - so cat pictures DO lie! Schmammel (talk) 15:40, 9 December 2013 (UTC)

What is the scale of the chart? Does 'top' = most trusted'? Never assume anything with xkcd. David.windsor (talk) 18:29, 9 December 2013 (UTC)

Brilliant. I didn't think of that at all. But now that you mention it... a .gif would be like a small part of a video. And people tend to trust those more than a static picture. 108.162.222.209 08:58, 13 December 2013 (UTC)

Of course Randall does not really think that the file extension determines trustworthiness; the graph is tongue-in-cheek. Information can be trustworthy or untrustworthy no matter the format it's given in. 108.162.216.221 18:50, 9 December 2013 (UTC)

Yes, I believe the explanation somewhat misinterprets Randall's intentions, especially when it comes to the image formats. I interpret it not as a question of loss of information due to compression but instead a more general impression of when and by whom these formats are used and, as a consequence, the trustworthiness of the information conveyed through these formats. That would explain the jpg/jpeg distinction as (in my experience though I can't provide data that support it) .jpg is nowadays the preferred compressed format in professional contexts and .jpeg looks slightly childish. 141.101.80.117 23:59, 9 December 2013 (UTC)

Reading more into the linked info about viruses embedded in JPEGs, it appears that the only way to receive a virus from a JPEG file would be to have already received another virus from a standard executable file, where such a virus causes the computer to execute code in a JPEG file rather than simply display it as it normally would. Since such a possibility is independent of the file type (the first virus might just as well have enabled code execution in DOC files, for instance), I've removed that bit of info. Zowayix (talk) 03:44, 10 December 2013 (UTC)

Can anyone explain the banner near the top of xkcd.com today, 10 Dec 2013? It reads, Dear Wikipedia readers: if everyone reading this showed up at my house, I would be like "what?" 108.162.219.220 (talk) 15:53, 10 December 2013 (UTC) (please sign your comments with ~~~~)

I believe that is a reference to the similar banner that is on top of wikipedia right now asking for donations. --Jeff (talk) 18:02, 10 December 2013 (UTC)
I don't see that banner, but it appears to be a play on Wikipedia's donation "pleas" that are often posted (including now) as banners at the top of Wikipedia which suggest that (to use the lates one:) "If everyone reading this donated, our fundraiser would be done within an hour". TheHYPO (talk) 18:05, 10 December 2013 (UTC)

I think it's a bit ambiguous whether Randall's references (for example) to jpg and gif means he doesn't trust that the images are accurate because of artifacting and stuff, or whether he's referring to jpgs and gifs that occasionally circulate with text on them as if to present information (e.g., lifehack images, or cat memes...) TheHYPO (talk) 18:05, 10 December 2013 (UTC)

missing suffices

Obviously .html & .htm are so far to the left, they're off the chart. :-) 108.162.249.117 17:43, 10 December 2013 (UTC)

Any idea what file type was used to spread this hoax? http://www.dailydot.com/lifestyle/apple-secret-bitcoin-mining-feature/ Various websites reporting on it use .JPG and .PNG, but I don't know what format the original graphic was. InspectorClouseau (talk) 16:16, 17 December 2013 (UTC)

I'd be pretty wary of .flv... Nick Douglas (talk) 15:16, 18 December 2013 (UTC)

I don't completely agree with ".png"'s explanation: "But since he rates the format so low, is Randall saying we shouldn't trust this chart?" I think it's being seen from the wrong perspective. In my opinion, ".png" is rated low due to being less capable and less commonly used to transmit trustworthy information than those rated higher. What do you all think? If you agree with me, please edit it, as I will not monitor this page.

I also think that ".tex"'s explanation is lacking. It should be said it's a way to format text documents using programming, in order to make them better looking and easier (for some) to write and format.

Plus, I generally disagree with a lot of what is said about file extensions, since our whole operating systems could work just fine if all extensions disappeared (provided that programs look for the right files by name only, and maybe a few more folders where created). But that's my own opinion, and not something to be added here. 108.162.219.125 02:57, 6 February 2015 (UTC)

One thing that seems to be overlooked here, is that GIFs are probably the least trustworthy because they can have those pop out horror images that scare you when you think you are just looking at a normal picture. 199.27.128.120 17:32, 20 April 2015 (UTC)

Funny story related to the trustworthiness of files, The other day a friend asked me "How do you make a jpg with transparency?" I said, "you can't." He sent me the file, sure enough it looked like a .jpg with transparency, it opened in windows pictures, in chrome, in firefox, however it wouldn't load in Gimp and it wouldn't load in Photoshop. I popped it into a file analyzer and it registered as a gif! So, yeah, gifs are pretty shady... 172.68.58.107 16:42, 11 September 2017 (UTC) Sam


Here's an entire Whitepaper written in a .txt file. If this isn't what this comic is trying to explain, I don't know what. http://www.linux-kvm.org/downloads/lersek/ovmf-whitepaper-c770f8c.txt?fbclid=IwAR1JnAtCs5syKoF70I0d-KnZpI3BnsceIRrCDgevCGrbVSejVThaKNlHDc0 172.69.44.152 (talk) 16:38, 15 June 2019 (UTC) (please sign your comments with ~~~~)