2109: Invisible Formatting

Explain xkcd: It's 'cause you're dumb.
(Redirected from 2109)
Jump to: navigation, search
Invisible Formatting
To avoid errors like this, we render all text and pipe it through OCR before processing, fixing a handful of irregular bugs by burying them beneath a smooth, uniform layer of bugs.
Title text: To avoid errors like this, we render all text and pipe it through OCR before processing, fixing a handful of irregular bugs by burying them beneath a smooth, uniform layer of bugs.

Explanation[edit]

Most word processor programs allow the user to select sections of text, usually by clicking and dragging the cursor across the text, or by using common mouse shortcuts such as double-clicking to select a word and triple-clicking to select an entire line. The selection is usually indicated by highlighting the text's background, such as the bright blue highlight shown in the comic.

A common reason for selecting text is so that formatting can be applied to the selection (eg. italics or bold formatting). Since space characters are part of the typography, such formatting gets applied to them too; however, as the character has no visible glyph, the formatting has no visible effect (a bold space looks exactly the same as an unformatted space). However, the formatting is still there in the document's underlying markup - it just can't be seen.

This leads to a possibility that a user may accidentally introduce invisible formatting into a document without noticing. Such formatting has no effect on how the end user will read the document, but it could theoretically cause problems for programs that later come along to parse the document, if those programs have not been told to expect formatting. Randall worries about this invisible threat.

In the comic, Randall accidentally introduces the invisible formatting by selecting one more character than he needed to ("n", "o", "t", and an extra space character), applying bold formatting to those four characters, changing his mind, reselecting only the three characters "n", "o", "t", and removing the bold formatting. Because he failed to notice that a space character had been selected when applying the bold, he failed to remove the bold formatting from the space. As a result, the document now contains an invisible bold space that will likely go unchecked, as nobody can see it to fix it.

There are a couple of ways Randall could have avoided this problem. In many word processors, double-clicking a word will select all characters in the word and nothing else; this is an easier action than trying to drag the cursor, which can be fiddly and inaccurate. This would have prevented Randall from accidentally selecting the space character; although could create the problem if multiple words (and the space(s) between) were initially enboldened but then constituent word-groupings were unenboldened, leaving the whitespace between unreverted. Alternatively, if the program had an "undo" feature, Randall could simply have undone the bold formatting instead of removing the formatting manually. This would have undone the bold formatting on the space character, fixing the problem (and saving time, too), but only presuming that other changes had not occured in the interim which weren't more important/time-saving to keep.

Though Randall is likely thinking of computer-related problems caused by his invisible formatting, there is also another possible problem: it leaves trace evidence of Randall's formatting attempt. For example, if an editor later comes along and notices the bold space, they may figure out that Randall originally bolded the word "not" before changing his mind. Depending on the context, a bolded "not" could be enough to change the tone of the text from polite and formal to dismissive (eg. "We believe you are not suitable for this position." vs "We believe you are not suitable for this position.")

In the title text, Randall says that he fixes such invisible formatting errors by running the text through OCR, which turns images into text. Since OCR uses optical recognition, it would not be able to detect the invisible formatting and would therefore not reproduce it. Although this would "fix" the invisible formatting issue, it would likely introduce a bigger problem: OCR is not 100% reliable at recognizing characters or formatting, and often produces inaccurate results. However, Randall facetiously suggests that this is a preferable state of affairs, as OCR at least produces errors at a reasonably consistent rate, which Randall feels is better than irregular invisible formatting errors.

As the title text explains, Randall finds it very important to control all information he publishes. Real-world examples are governments changing the impact of reports for political reasons. Attempted tampering of this kind can be revealed by bold spaces. Another example would be a casual and short one-sentence reply e.g. to a romantic interest, which one takes one hour to formulate to sound as natural as possible.

There are also other occasions where a hidden bold space may be a problem for later editors (see the Trivia section below). Randall’s background in computer programming could also make him more attentive to these types of technical problems, and therefore add this as a reason for his worries about invisible formatting.

Transcript[edit]

[A text editor, with some options. They are superscript in one section, bold, italic and underscore in another section and alignments in the third section. The word "not ", including the following space, is highlighted in blue. There is a cursor below it.]
Text: ...ere, but would not have to mo...
Action: Select
[The cursor is on the "bold" option and the selected word is bolded.]
Text: ...ere, but would not have to mo...
Action: Click
[The cursor is next to the "to". No text is highlighted.]
Thought bubble: ...Nah, the bold is too much.
Text: ...ere, but would not have to mo...
[The word "not" is now highlighted in blue again, but the following space is not.]
Text: ...ere, but would not have to mo...
Action: Select
[The cursor is on the "bold" option and the selected word is not bolded.]
Text: ...ere, but would not have to mo...
Action: Click
[The cursor and the blue highlighting are gone. The space after "not" has a dashed box around it, and an arrow points to it.]
Text: ...ere, but would not have to mo...
Arrow: Hidden bold space
[Caption below the panels:]
When editing text, in the back of my mind I always worry that I'm adding invisible formatting that will somehow cause a problem in the distant future.

Trivia[edit]

There are also other occasions where a hidden bold space may be a problem for later editors etc. These include:

  • Editing that adds some text at the location of the space will make this text bold.
  • Exporting to plain text files. If for example a markdown style is used, there will be characters in the output that do not make sense.
  • Scraping, data mining, and linguistics processing by computer algorithms. Often (although not always) these algorithms are written based on samples of training or testing text that may not have spurious formatting present, and may misprocess something when encountering the spurious formatting.
  • Wikis. In this sentence, every space is a hidden bold space. From the editing view, all the spaces look like''' '''this. This will annoy all future editors of this article, due to the hidden apostrophes which are formatting the spaces. They may also accidentally introduce bold words.
    • By default, MediaWiki attempts to prevent this by not including the trailing spaces in the bold formatting when you click the “bold” button, so someone has to manually type the formatting apostrophes to do this.
  • A situation where formatted text is not allowed, and is rejected, but the user failed to strip formatting from the spaces, and this is noticed.
  • If a font has the word space look different between the bold and the regular, perhaps to make it so bold words are spaced closer to each other, the spacing will look inconsistent if there is a hidden bold space.
  • Unnecessary extra formatting will usually unnecessarily increase file size, which may put the document above some maximum file size threshold.
  • Bold (or italic or non-breaking) spaces are also popular in steganography. By using bold spaces in some places and not in others it is possible to hide secret information in a public text, that will not be visible to the casual reader, who does not explicitly search for the hidden information. Additionally if such a document is found with a person, that person can plausibly deny all knowledge of the encoded information.

Popular modern word processing programs have features which may make it easier to notice improperly formatted invisible characters. In the tutorials linked here, one may learn how to view invisible characters in Microsoft Word, Pages and LibreOffice Writer, however even with this on it would be difficult to spot a bolded space (which looks like a bolded dot – now visible but so small it's still hard to tell if it's bold or not). In the older word processor WordPerfect, one could do this with the “Reveal Codes” feature, which showed you character codes, separate from the characters themselves, around the characters. For example, a bolded space would look something like "[BOLD≻≺BOLD]".

Web sites which allow content to be edited by users but generate the formatting code automatically often have versions of the invisible formatting problem; for example, eBay listings which use anything other than the default font rapidly accumulate hard spaces, font end and begin transitions, and other invisible formatting if they are subsequently edited, which can slow page loading and cause other problems. This is also seen in blogs etc.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

This reminds me of the person who used l (lower-case "L") instead of 1 for data entry at some business. Amazingly, the computer accepted it (BAD programming!) and it wasn't found out until the end of the tax year, when all heck broke loose! 162.158.75.136 14:50, 8 February 2019 (UTC)

Some programming puzzles are often solved with stuff like this: AΑ Fabian42 (talk) 15:19, 8 February 2019 (UTC)
"l" (lower-case "L") is a valid suffix to integer literals in C and derived languages. It indicates the number is of the "long int" type as opposed to a plain "int". Because C automatically upconverts the "int" type into "long int" when needed, the "l" suffix is rarely used. The result: "long int a = 1;" and "long int a = 1l;" mean exactly the same thing, and both statements are perfectly standard and won't raise any warning from compilers. "ll" (double el) is also a valid suffix, this time for the "long long int" type. GuB (talk) 15:39, 8 February 2019 (UTC)
Typing lowercase L instead of 1 is a common thing for people of a certain age. Old manual typewriters usually don't have a "1" key, so people learned to use lowercase L instead -- and sometimes slip back into that habit on newer technology. --Aaron of Mpls (talk) 02:03, 9 February 2019 (UTC)
Tha's exactly what happened in my example. I blame the programmer, though, for allowing a letter where a numeral was required or possibly converting the l to a 1 if the programmer knew such a thing ever happened. In either case, it shouldn't have allowed the l to just sit there like a bomb waiting to blow apart the post-tax-year processing. 172.68.58.83 15:22, 9 February 2019 (UTC)

I went to this page, expecting it to be self-referential. Was not disappointed. Fabian42 (talk) 15:19, 8 February 2019 (UTC)

Some markup conversion tools don't handle hidden bold spaces correctly. This HTML to Markdown converter is an example: https://anthonychu.github.io/to-markdown/ It converts <b>a </b> to **a ** instead of **a** . 172.69.62.10 15:40, 8 February 2019 (UTC)

Hah, this comment is not mine! Somehow I have your IP now. 172.69.62.10 17:47, 8 February 2019 (UTC)

Were the periods in the beginning there for a specific reason? Netherin5 (talk) 17:42, 8 February 2019 (UTC)

The user 108.162.245.16 thought it was a good idea for some reason. Glad you fixed it. I finished the job 172.69.62.10 17:46, 8 February 2019 (UTC)

I've had this happen when writing papers. Bold. Unbold. Later backspace into the hidden bold space and everything typed after gets put in bold. If a professor gives you a page count instead of a word count, you can make the punctuation in your paper bold (or increase the font) to add some extra padding that might go unnoticed. Don't actually do this if you can't convey your thesis in fewer words. 172.69.210.52 18:11, 8 February 2019 (UTC)

I hated when Microsoft Word took over and lacked a real "Reveal Codes" like WordPerfect used to have. I'm kind of like Randall, I think about those behind-the-scenes things that lots of companies like to try to hide from the user, and I like the power to do something about them. -boB (talk) 18:58, 8 February 2019 (UTC)

When I saw the strip, I immediately thought of Word Perfect because its brain dead way of inserting formatting as special codes inline with the text. Hit "reveal codes" and it would reveal a string of bold on / bold off codes because it wasn't clever enough to optimise them away. I assume Word does it differently, perhaps with attributed strings and so doesn't need the reveal codes function so you can manually fix the mess the program has a made.

In Microsoft Word, where the majority of people would have experience with selecting and bolding text, the cursor appears as an "I-beam" when positioned over text and not as the "mouse pointer arrow" shown by Randall. Also, in Word double-clicking a word does select the following space(s), but when bold is applied it is applied only to the selected word, NOT to the trailing space (even though the space was selected when the bold was applied). So selecting just the word and un-bolding would not leave a bolded space behind, since the space was never bolded. Clearly Randall's example is in some editor other than Word. Since Word is where most people have familiarity with selecting and bolding text, something should be added to the explanation noting this and speculating on which text editor Randall is actually showing. - 108.162.246.215 20:35, 8 February 2019 (UTC)

Agreed. Most text editors do not select the trailing space when double-clicking. Microsoft Word is one of the few that does it. But in that case, the space is not formatted as bold. But in most word processors including Word, if you do select the word with the trailing space and apply the bold formatting, the space retains the formatting even if the word is un-bolded. So the first sentence of the explanation is incorrect.
Do they not? Notepad does it. Notepad++ does it. Your browser does it. Where is the wealth of programs that don't? I reckon this is the default system-wide behavior for double-clicking in Windows, regardless of program. 172.68.65.228 11:46, 9 February 2019 (UTC)
It seems to be indeed Windows issue, as everything I tried did highlight extra space (except Notepad++), but nothing I tried on Linux did. 162.158.90.36 13:59, 9 February 2019 (UTC)

Hidden formatting annoys translators greatly. Sometimes, the formatting of the word processor used and the formatting recognized by the CAT program (such as SDL Trados Studio or MemoQ) do not line up very well, which causes the formatting to appear as tags within the text (purple colored in the most widely used CAT software, Trados). If there is sloppy or hidden formatting all through the document, this turns into what most people call a "wall of purple", with tags everywhere within the document. Since tags need to be accounted for (otherwise the document does not save properly), and the formatting capability of most CAT tools is a lot more limited compared to any word processors, this is a colossal waste of time for any translator to wade through. Thus, if you leave any hidden formatting in a document and you know it will be translated somewhere down the line, you know there is a translator out there that curses the day you were born. (A note though - PDF conversion is responsible for a lot more wall of purple incidents than sloppy formatting. Seriously - if you expect a document to be translated at some point, never bring it anywhere close to the PDF format. That format is evil, I tell you. Pure evil.) 162.158.89.61 05:47, 9 February 2019 (UTC)

In WordPerfect for DOS, the codes were [BOLD] to turn bold on and [bold] to turn it off again. --162.158.38.40 11:30, 9 February 2019 (UTC)

The whole idea of invisible formatting is being used by some websites, including Facebook, to make it much harder for ad blockers to block ads. For example, https://twitter.com/themikepan/status/1093035372186034176 Of course, the same can also be used to defeat swear filters on forums, as well (which, for some words like "bastard sword," *the moderators* themselves suggest doing). Draco18s (talk) 19:43, 9 February 2019 (UTC)

We have a category for comics with colour... can we have a category for comics with lowercase letters? :) Undergroundmonorail (talk) 02:33, 10 February 2019 (UTC)

I frequently see a similar, related problem. In preparing a weekly newsletter (consisting mostly of links to articles from various news sources), people submitting articles to me usually send me Microsoft Word files into which they have used copy/paste to insert the headline, URL and a few lines of text for context. On far too many articles, I find that the resulting text has embedded UNICODE Left-to-right mark characters (U+200E) in it. These don't affect display and printing at all (since all of the text is already left-to-right), but it creates broken links if one appears in a URL and I copy/paste it into a web browser's location bar. There doesn't seem to be any way to make these characters visible in Word. If manually cursoring over the text (with left/right keys), you will see the cursor change shape without moving when stepping over the left-to-right mark, but that's the only indication. It's quite annoying to have to work around. (If anyone knows of a good workaround, please let me know.) Shamino (talk) 19:32, 10 February 2019 (UTC)

I frequently cut-and-paste text into Notepad (or gedit, or some other text-only editor etc.), then cut-and-paste it back to Word or whatever other "rich text" capable destination I am using -- this removes all hidden junk, formatting, font changes, bold, etc. and the pasted text takes on the characteristics of wherever it's pasted into rather than where it came from. This is basically taking the text down to the bare minimum, and then I can reintroduce whatever formatting I want it to have. -boB (talk) 16:47, 11 February 2019 (UTC)

GIMP is really bad about this when trying to add text to an image. You either end up with the formatting not wanting to stick, or you end up with invisible formatting all over the place. Dark talk 00:15, 11 February 2019 (UTC)


Seems to me that everybody here misses the point of the comic. Which is not the problems hidden left over formatting could do to later text. The joke here is that Randall is about to write something where he really means that NOT. But then regrets it, as he is afraid that the reader of his text/message would take offense of having this not shouted out in bold! So he reverts the bold, but because he misses the space, he has left a proof that he actually did mean Not and this can now be found out by the receiver anyway, which might then take offense anyway, or take offense that Randall felt he had to delete the bold, as if the receiver could not handle this (of course if he took offense from this Randall had proved his point, but never the less he tries to avoid this.). All this is mentioned now at the very end of a long list of indifferent problems such a bold space could create. I will move this up to the top now, as the main explanation. --Kynde (talk) 10:06, 13 February 2019 (UTC)

I found (and find) the typography in this comic troubling, because while it is clearly a proportionally spaced font ("l" is 5px wide, "w" is 23px), the boldfaced and roman "not"s are the same size (49px wide). In a normal proportionally spaced situation, the boldfaced letters would be wider. JohnHawkinson (talk) 03:23, 23 February 2019 (UTC)

In an edit last week I removed the claims that "Randall bolds text via clicking" and that it "could indicate that Randall is not familiar with using word processors." 172.68.144.145 just reverted my removal, and I wanted to explain here why '.145 is wrong, in a little more space than the edit summary allows. I said originally, "An iconbutton is used for bold in comics for illustrative purposes, because you can't see the keyboard. It does not reflect the author-artist's knowlege." That is, we cannot draw conclusions about Randall's knowledge based on the fact that he didn't illustrate in this comic using a keyboard.
'.145 asks, perhaps rhetorically, "Then why not just write "Ctrl+B"? You can't see the mouse either, but you know what "click" and "select" are referring to."
First of all, it doesn't matter. The comic could also have illustrated use of a menu, but that wouldn't tell us anything about Randall's knowledge of the iconbutton or the keyboard shortcut. Without any information about this, it's not possible to make reasonable inferences about this, and so the explanation shouldn't even go there. Secondly, there are good reasons why an iconbutton makes more sense (not that I'm required to supply them); because keyboard shortcuts are not as discoverable as iconbuttons or menus (and menus take a lot of space that make them hard in a comic of small compact multiples like this one) that means more people are familiar with the menu or button than the keyboard shortcut, and indeed those who know the keyboard shortcut are generally a subset of those who know another method; and further still, "Ctrl+B" is not platform-independent (e.g. Mac users need Cmd+B) or software-independent (InDesign users need Cmd+Shift+B). Thirdly, you can indeed see the mouse pointer, so I'm not sure what '.145 is trying to suggest. And finally, it's utterly ridiculous and kind of offensive to suggest (without any real basis) that Randall doesn't know how to use a word processor. That a person chooses to use one method, even if it's not the most efficient method, doesn't mean they are "not familiar with using word processors." We don't even know what Randall's UI preferences are here, but even if we did that wouldn't be enough to suggest a lack of familiarity rather than a personal preference. The text from this edit is not encyclopedic and should stay out. JohnHawkinson (talk) 14:48, 4 March 2019 (UTC)

In LibreOffice Writer on Linux if I select a word with double-click it doesn't include the space, but if I select it with the keyboard using Ctrl+Shift+RightArrow it does include the space. In the comic it looks like the selection was made with the mouse, but it's not explicit. 172.68.189.193 00:15, 11 July 2019 (UTC)

I do this. 162.158.146.41 (talk) 02:05, 25 August 2022 (please sign your comments with ~~~~)

I work with some autogenerated documents that are over 90% formatting (by file size), despite being pretty uniform to the eye. Still trying to figure out how to edit the autogenerator to not do that. 172.68.245.24 10:06, 1 August 2024 (UTC)