Editing 2739: Data Quality

{{comic
| number    = 2739
| date      = February 17, 2023
| title     = Data Quality
| image     = data_quality_2x.png
| imagesize = 671x211px
| noexpand  = true
| titletext = [exclamation about how cute your cat is] -> [last 4 digits of your cat's chip ID] -> [your cat's full chip ID] -> [a drawing of your cat] -> [photo of your cat] -> [clone of your cat] -> [your actual cat] -> [my better cat]
}}

==Explanation==
{{incomplete|Created by a SUPERIOR FELINE. Do NOT delete this tag too soon.}}
<!-- Specifically "No Idea If There's A Character Limit LMAO": please refrain from removing any more Incomplete tags by yourself and so quickly, and please check your Talk page! And please remove this comment once you've read it. :) -->

Digital data are transferred in bits, and {{w|data loss}} is the process by which some of these bits are lost or altered during data transport. Data can also be compressed to make transmission and/or storage more efficient; some {{w|compression algorithms}} discard some data to improve the compression (this can be acceptable in audio or visual data, since the difference may be hard for humans to perceive).

This comic shows a chart in the form of a line, increasing quality from very lossy to most lossless. This means that it goes, at the extremes, from having so little of the target data (making it effectively meaningless) to having significant extra data included (eventually making the original actually an unnecessary distraction). However the highest quality, "better data", is using a different sense of the term "quality", referring more to the general excellence of the data than how accurately it represents the original.

The title text uses your cat as an example of this range of losses (or, in the case of the latter reaches of the graph, gains) in the data. This is possibly a reference to [https://www.goodreads.com/quotes/8157292-the-best-material-model-of-a-cat-is-another-or Norbert Wiener]'s quote, "The best material model of a cat is another, or preferably the same, cat." The most lossy is an exclamation about how cute your cat is, which is ephemeral and obviously carries very little significance in terms of actually providing specific, transferrable information about your cat. The example then progresses into your cat's chip ID; presumably your cat has been microchipped, and between the last four digits (commonly used in sensitive information as an identifier without revealing the full number) or the entire chip ID, provides a still-uninformative yet slightly improved way of identifying your cat. A drawing of your cat and a photo of your cat would portray the cat reasonably well, while a clone of your cat and (of course) your actual cat would be the best way of gaining data about your cat. However, as in the actual comic, the final, most lossless (in this case, with the most gain) form of data transfer has nothing to do with your cat, but is simply Randall's better cat. This is apparently made out by Randall to be the pinnacle of cat data.

=== Details ===
{| class="wikitable" 
|-
! Item
! Explanation
|-
| {{w|Bloom filter}}
| A Bloom filter is a probabilistic data structure that can efficiently say whether an element is ''probably'' part of the dataset, while it can say "element is not in set" with 100% accuracy. If a Bloom filter is used to compress the contents of a book, the Bloom filter can re-tell a similar story - just by guessing.
|-
| {{w|Hash table}}
| A hash table allows you to find data very fast. Randall probably means hashing the contents of entire books. Calculating a hash value for an entire book means that there is (most probably) a unique relationship between the book and a hash value - e.g. "58b8893b2a116d4966f31236eb2c77c4172d00e9". This means the book will yield this exact hash value, though it's impossible to reconstruct the book's content from a hash vaue. It is a highly efficient, but is meaningless: An average book contains several millions of bits, yet the SHA-2 hash has only 256 bits.
|-
| {{w|JPEG|JPG}}, {{w|GIF}}, {{w|MPEG-1|MPEG}}
| Image and video formats that are considered 'lossy'. JPG (or "JPEG") format and the MPEG {{w|MPEG-2|group}} {{w|Advanced Video Coding|of}} formats typically use a range of data-compression methods that save space by selectively fudging (thus losing) what details it can of the image (and audio, where appropriate), to make disproportionate gains in compression; best used for real world images (and films) where real-world 'noise' can afford to be replaced by a more compressible vesion, without too much obvious change.
GIF compression is not 'lossy' in the same way, i.e. whatever it is asked to encode can be faithfully decoded, but Randall may consider its limitations (it can only write images of 256 unique hues, albeit that these can come from anywhere across the whole 65,536 "True color" range, plus transparency) to be a form of loss, as conversion from a more sophisticated format (e.g. PNG, below) could lose many of the subtle shades of the original and produce an inferior image. For this reason, GIF format became one best left to render diagrams and other computer-generated imagery with swathes of identical pixels and mostly sharp edges (and to utilise the optional transparent mask). Alternatively, he may just have included it as a joke/nerd-snipe.
|-
| {{w|PNG}}, {{w|ZIP (file format)|ZIP}}, {{w|TIFF}}, {{w|WAV}}
| A series of formats using lossless compression. PNG and TIFF are image formats, that are suitable for photos but without resorting to reduced accuracy in order to assist compression. WAV is an audio format that also does not arbitrarily sacrifice 'unnecessary' details, unlike the more recently developed {{w|MP3|MPEG Audio Layer III}} which has become the defacto consumer audio format for many.
ZIP is a generic compression algorithm(/format) that can be used to store any other digital file, for exact decompression later on, although any file(s) already compressed in some way are not likely to compress significantly more.
|-
| Parity bits for error detection
| In the number 135, the sum of digits is 9. So, the number 135 could be written as "135-9". If the number was tampered with, the parity bits could tell you so (in some cases), or possibly that the parity itself was the digit that was miswritten. But a change from "135" to "153" could not be detected that way. There are more reliable means to detect errors: The obsolete CRC-32 and MD5, and the much more modern {{w|Secure Hash Algorithm|SHA}}.
|-
| Parity bits for error correction
| There are ways to restore the original data with the given additional data. One method is to 'overload' with multiple different methods of error-detection parity such that any small enough corruption of data (including of the parity bits themselves) can be reconstructed to the correct original value. One of the first such methods is {{w|Hamming(7,4)}}, invented around 1950.
|}

==Transcript==
:[A line chart is shown with eight unevenly-spaced ticks each one with a label beneath the line. Above the middle of the line there is a dotted vertical line with a word on either side of this divider. Above the chart there is a big caption with an arrow pointing right beneath it.]
:<big>Data Quality</big>
:Lossy ┊ Lossless

:[Labels to the left of the dotted line from left to right:]
:Someone who once saw the data describing it at a party
:Bloom filter
:Hash table
:JPEG, GIF MPEG

:[Labels to the right of the dotted line from left to right:]
:PNG, ZIP, TIFF, WAV, Raw data
:Raw data + parity bits for error detection
:Raw data + parity bits for error ''correction''
:Better data

{{comic discussion}}

[[Category:Charts]]
[[Category:Cats]]
@@ Line 10: / Line 10: @@
 ==Explanation==
+{{incomplete|Created by a SUPERIOR FELINE. Do NOT delete this tag too soon.}}
 <!-- Specifically "No Idea If There's A Character Limit LMAO": please refrain from removing any more Incomplete tags by yourself and so quickly, and please check your Talk page! And please remove this comment once you've read it. :) -->
-Digital data can be compressed to make transmission and/or storage more efficient; some {{w|compression algorithms}} discard some information to improve the compression, which is known as lossy compression, since some of the information is lost (this can be acceptable in audio or visual data, since the difference may be hard for humans to perceive).
+Digital data are transferred in bits, and {{w|data loss}} is the process by which some of these bits are lost or altered during data transport. Data can also be compressed to make transmission and/or storage more efficient; some {{w|compression algorithms}} discard some data to improve the compression (this can be acceptable in audio or visual data, since the difference may be hard for humans to perceive).
-This comic shows a chart in the form of a line, increasing quality from very lossy to most lossless. This means that it goes, at the extremes, from having so little information as to make it effectively meaningless, to having significant extra information included (eventually making the original actually an unnecessary distraction). Some of this extra information mitigates the risk of another sense of 'loss' in data - digital data are transferred in bits, and {{w|data loss}} is the process by which some of these bits are lost or altered during data transport. However the highest quality, "better data", is using a different sense of the term "quality", referring more to the general excellence of the data than how accurately it represents the original.
+This comic shows a chart in the form of a line, increasing quality from very lossy to most lossless. This means that it goes, at the extremes, from having so little of the target data (making it effectively meaningless) to having significant extra data included (eventually making the original actually an unnecessary distraction). However the highest quality, "better data", is using a different sense of the term "quality", referring more to the general excellence of the data than how accurately it represents the original.
-The title text uses your cat as an example of this range of losses (or, in the case of the latter reaches of the graph, gains) in the information. This is possibly a reference to [https://www.goodreads.com/quotes/8157292-the-best-material-model-of-a-cat-is-another-or Norbert Wiener]'s quote, "The best material model of a cat is another, or preferably the same, cat." The most lossy is an exclamation about how cute your cat is, which is ephemeral and obviously carries very little significance in terms of actually providing specific, transferable information about your cat. The example then progresses into your cat's chip ID; presumably your cat has been microchipped, and between the last four digits (commonly used in sensitive information as an identifier without revealing the full number) or the entire chip ID, provides a still-uninformative yet slightly improved way of identifying your cat. A drawing of your cat and a photo of your cat would portray the cat reasonably well, while a clone of your cat and (of course) your actual cat would be the best way of gaining information about your cat. However, as in the actual comic, the final, most lossless (in this case, with the most gain) form of data transfer has nothing to do with your cat, but is simply Randall's better cat. This is apparently made out by Randall to be the pinnacle of cat data.
+The title text uses your cat as an example of this range of losses (or, in the case of the latter reaches of the graph, gains) in the data. This is possibly a reference to [https://www.goodreads.com/quotes/8157292-the-best-material-model-of-a-cat-is-another-or Norbert Wiener]'s quote, "The best material model of a cat is another, or preferably the same, cat." The most lossy is an exclamation about how cute your cat is, which is ephemeral and obviously carries very little significance in terms of actually providing specific, transferrable information about your cat. The example then progresses into your cat's chip ID; presumably your cat has been microchipped, and between the last four digits (commonly used in sensitive information as an identifier without revealing the full number) or the entire chip ID, provides a still-uninformative yet slightly improved way of identifying your cat. A drawing of your cat and a photo of your cat would portray the cat reasonably well, while a clone of your cat and (of course) your actual cat would be the best way of gaining data about your cat. However, as in the actual comic, the final, most lossless (in this case, with the most gain) form of data transfer has nothing to do with your cat, but is simply Randall's better cat. This is apparently made out by Randall to be the pinnacle of cat data.
 === Details ===
@@ Line 22: / Line 23: @@
 |-
 ! Item
-! Title Text
 ! Explanation
-|-
-| Someone who once saw the data describing it at a party
-| exclamation about how cute your cat is
-| This is referring to how unreliable and inaccurate it is to get information verbally second-hand, as humans are naturally terrible at maintaining accuracy when passing on information received. This is the basic premise behind {{w|Chinese whispers|the Telephone Game}}. People naturally and instinctively mentally summarize information received in the way they understand, often in their own words instead of what they literally heard or read.
 |-
 | {{w|Bloom filter}}
-| last 4 digits of your cat's chip ID
+| A Bloom filter is a probabilistic data structure that can efficiently say whether an element is ''probably'' part of the dataset, while it can say "element is not in set" with 100% accuracy. If a Bloom filter is used to compress the contents of a book, the Bloom filter can re-tell a similar story - just by guessing.
-| A Bloom filter is a probabilistic data structure that can efficiently say whether an element is ''probably'' part of the dataset, while it can say "element is not in set" with 100% accuracy. If a Bloom filter is used to represent the contents of a book, reference to the Bloom filter could perhaps reconstruct everything, just by guessing, but in a highly inefficient and potentially inaccurate way. A bloom-filter is like a the last four digits of the cat's ID in that while you can know for sure a cat isn't your cat if it's last four digits don't match, you can't know for sure that it is yours if they do.
 |-
 | {{w|Hash table}}
-| your cat's full chip ID
+| A hash table allows you to find data very fast. Randall probably means hashing the contents of entire books. Calculating a hash value for an entire book means that there is (most probably) a unique relationship between the book and a hash value - e.g. "58b8893b2a116d4966f31236eb2c77c4172d00e9". This means the book will yield this exact hash value, though it's impossible to reconstruct the book's content from a hash vaue. It is a highly efficient, but is meaningless: An average book contains several millions of bits, yet the SHA-2 hash has only 256 bits.
-| A hash table allows you to find data very fast. Randall probably means hashing the contents of entire books. Calculating a hash value for an entire book means that there is (most probably) a unique relationship between the book and a hash value - e.g. "58b8893b172d00e9". This means this exact version of the book will yield this exact hash value, though it's practically impossible to reconstruct the book's potential content from a hash value. It is a method of checking that a copy is the same as the original, but is meaningless on its own and has the possibility of being wrong. An average book contains several millions of bits, yet the SHA-2 hash has only 256 bits, so there are theoretically many (mostly nonsensical, but not necessarily) 'wrong' versions that might look correct.
 |-
 | {{w|JPEG|JPG}}, {{w|GIF}}, {{w|MPEG-1|MPEG}}
-| a drawing of your cat
+| Image and video formats that are considered 'lossy'. JPG (or "JPEG") format and the MPEG {{w|MPEG-2|group}} {{w|Advanced Video Coding|of}} formats typically use a range of data-compression methods that save space by selectively fudging (thus losing) what details it can of the image (and audio, where appropriate), to make disproportionate gains in compression; best used for real world images (and films) where real-world 'noise' can afford to be replaced by a more compressible vesion, without too much obvious change.
-| Image and video formats that are considered 'lossy'. JPG (or "JPEG") format and the MPEG {{w|MPEG-2|group}} {{w|Advanced Video Coding|of}} formats typically use a range of data-compression methods that save space by selectively fudging (thus losing) what details it can of the image (and audio, where appropriate), to make disproportionate gains in compression; best used for real world images (and films) where real-world 'noise' can afford to be replaced by a more compressible version, without too much obvious change.
+GIF compression is not 'lossy' in the same way, i.e. whatever it is asked to encode can be faithfully decoded, but Randall may consider its limitations (it can only write images of 256 unique hues, albeit that these can come from anywhere across the whole 65,536 "True color" range, plus transparency) to be a form of loss, as conversion from a more sophisticated format (e.g. PNG, below) could lose many of the subtle shades of the original and produce an inferior image. For this reason, GIF format became one best left to render diagrams and other computer-generated imagery with swathes of identical pixels and mostly sharp edges (and to utilise the optional transparent mask). Alternatively, he may just have included it as a joke/nerd-snipe.
-GIF compression is not 'lossy' in the same way, i.e. whatever it is asked to encode can be faithfully decoded, but Randall may consider its limitations (it can only write images of 256 unique hues, albeit that these can come from anywhere across the whole 65,536 "True color" range, plus transparency) to be a form of loss, as conversion from a more sophisticated format (e.g. PNG, below) could lose many of the subtle shades of the original and produce an inferior image. For this reason, GIF format becomes one best left to render diagrams and other computer-generated imagery with swathes of identical pixels and mostly sharp edges (and to utilize the optional transparent mask), for which JPEG compression will create prominant image artefacts. Alternatively, he may just have included it as a joke/nerd-snipe.
 |-
-| {{w|PNG}}, {{w|ZIP (file format)|ZIP}}, {{w|TIFF}}, {{w|WAV}}, raw data
+| {{w|PNG}}, {{w|ZIP (file format)|ZIP}}, {{w|TIFF}}, {{w|WAV}}
-| photo of your cat
+| A series of formats using lossless compression. PNG and TIFF are image formats, that are suitable for photos but without resorting to reduced accuracy in order to assist compression. WAV is an audio format that also does not arbitrarily sacrifice 'unnecessary' details, unlike the more recently developed {{w|MP3|MPEG Audio Layer III}} which has become the defacto consumer audio format for many.
-| A series of formats using lossless compression. PNG and TIFF are image formats that are suitable for photos, but without (necessarily) resorting to reduced accuracy in order to assist compression. WAV is an audio format that also does not arbitrarily sacrifice 'unnecessary' details, unlike the more recently developed {{w|MP3|MPEG Audio Layer III}} which has become the de-facto consumer audio format for many.
+ZIP is a generic compression algorithm(/format) that can be used to store any other digital file, for exact decompression later on, although any file(s) already compressed in some way are not likely to compress significantly more.
-ZIP is a generic compression algorithm (and the name of the format it creates) that can be used to store any other digital files. Anything put within a ZIP file can be exactly decompressed into the original state later on, although any such file already compressed in some way (such as any of the image formats mentioned in this comic, or other ZIPs) are unlikely to recompress significantly more.
-|-
-| Raw data + parity bits for error detection
-| clone of your cat
-| In the number 135, the sum of its digits is 9. So the number 135 could be written as "1359", for example, slightly increasing the amount of data that needs to be sent. But with the slight advantage that, if the number was tampered with, the parity bits may be able tell you that an error has occured. (Possibly that the parity itself was the digit that was miswritten.) But a change from "1359" to "1539" could not be detected, in this method, when extracting the parity digit and using this to presume that the first three digits are indeed 'correct'.
-There are more reliable means to detect errors, such as CRC-32 (now considered obsolete), MD5 and the much more modern {{w|Secure Hash Algorithm|SHA}}. Such values were alluded to in the Hash Table section. But here they are sent ''alongside'' the data, slightly increasing the amount of data transmitted/stored (in order to establish its accuracy), rather than instead of it and vastly decreasing the amount of 'necessary' data (but leaving the virtually impossible task of performing a correct reconstruction).
-However it is done, if the check indicates a problem then you can only seek a new copy (of the data, and/or the parity or hash), hoping that the problems encountered can be resolved.
 |-
-| Raw data + parity bits for error ''correction''
+| Parity bits for error detection
-| your actual cat
+| In the number 135, the sum of digits is 9. So, the number 135 could be written as "135-9". If the number was tampered with, the parity bits could tell you so (in some cases), or possibly that the parity itself was the digit that was miswritten. But a change from "135" to "153" could not be detected that way. There are more reliable means to detect errors: The obsolete CRC-32 and MD5, and the much more modern {{w|Secure Hash Algorithm|SHA}}.
-| With extra error-checking, there are ways to immediately restore the original data with the given additional data. One method is to 'overlap' multiple error-detection parities such that any small enough corruption of data (including of parity bits themselves) can be reconstructed to the correct original value by cross-comparison between all parity bits and the supposed data. One of the first modern methods developed was {{w|Hamming(7,4)}}, invented around 1950, which was a balanced approach designed to handle the typical error conditions typically encountered at the time and has inspired even contemporary electronic methods of maintaining data integrity. Another practical application of error correction bits would be that present in {{w|QR_code#Error_correction|QR Codes}}, using {{w|Reed–Solomon error correction|Reed–Solomon error correction}}.
 |-
-| Better data
+| Parity bits for error correction
-| my better cat
+| There are ways to restore the original data with the given additional data. One method is to 'overload' with multiple different methods of error-detection parity such that any small enough corruption of data (including of the parity bits themselves) can be reconstructed to the correct original value. One of the first such methods is {{w|Hamming(7,4)}}, invented around 1950.
-| This gives up on the data in question and suggests swapping it for different data entirely. It is no longer about the quality of the transfer of data, but judging the actual data instead. Philosophically, it could be saying that the data or cat are better in some nebulous way, or that it simply is more accurate to what the data is trying to record and represent, in the title text's case saying that Randall's cat more closely represents the essence of "catness."
 |}
 ==Transcript==
-:[A line chart is shown with eight unevenly-spaced ticks each one with a label beneath the line. Above the middle of the line there is a dotted vertical line with a word on either side of this divider. Above the chart there is a big caption with an arrow beneath it pointing right.]
+:[A line chart is shown with eight unevenly-spaced ticks each one with a label beneath the line. Above the middle of the line there is a dotted vertical line with a word on either side of this divider. Above the chart there is a big caption with an arrow pointing right beneath it.]
 :<big>Data Quality</big>
 :Lossy ┊ Lossless