Editing 2054: Data Pipeline
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 8: | Line 8: | ||
==Explanation== | ==Explanation== | ||
− | In the first panel [[Cueball]] shows [[Ponytail]] and [[White Hat]] a Data Pipeline he has constructed that, as he puts it, <nowiki>'collects and processes all the data we need'</nowiki>. This implies that the three are running some sort of project that requires data processing. Ponytail assumes that this data pipeline is an unstable mess of scripts that will cease to function correctly should any unexpected input be received. Cueball | + | {{incomplete|Please direct all data pipelines to the explanation below and only mention here why it isn't complete. Do NOT delete this tag too soon.}} |
+ | In the first panel [[Cueball]] shows [[Ponytail]] and [[White Hat]] a Data Pipeline he has constructed that, as he puts it, <nowiki>'collects and processes all the data we need'</nowiki>. This implies that the three are running some sort of project that requires data processing. Ponytail assumes that this data pipeline is an unstable mess of scripts that will cease to function correctly should any unexpected input be received. Cueball responds by telling her reluctantly that this is very likely, although he seems to hope that it might not be. Ponytail then seems impressed and expresses this to him. She, however, gets interrupted by Cueball who tells her that the system just malfunctioned and collapsed. He, however, states that he can fix it, making it seem like this cycle of patching and collapsing could repeat infinitely, or until all problems have been patched. [[Code Quality|Knowing Cueball's code, though,]] it seems more likely he can't patch it. | ||
− | In the title text, Ponytail or White Hat proceeds to question how such an important system can run on such a small computer. However, Cueball makes it worse by saying he uses his phone due to the better connection. While this might | + | In the title text, Ponytail or White Hat proceeds to question how such an important system can run on such a small computer. However, Cueball makes it worse by saying he uses his phone due to the better connection. While this might make the pipeline functional, it also makes it far more fragile. |
− | This comic | + | This comic is a logical continuation of the Code Quality series ([[1513: Code Quality]], [[1695: Code Quality 2]] and [[1833: Code Quality 3]]), further highlighting Cueball's coding ineptitude and Ponytail's exasperation with it. |
− | + | It's quite common for somebody who codes for enjoyment with most of their time to attempt to automate absolutely everything that is done. Whenever a rote task is seen, a programmer thinks, "why is a human doing this when the time could be spent making a computer do it automatically, forever?" Unfortunately, without the advent of strong artificial intelligence, one of the places this begins breaking down is in aggregating information from multiple sources. | |
+ | |||
+ | People tend to publish their data via a variety of different channels, and as they are not programmers and don't share the value of consistency and computer-processability, it is all in completely different formats. Some data is only available in print. Some data is only available as photographs. Some data is only available as written reports. A certain kind of nerd will see this situation and become exciting, seeing the opportunity to automate something that nobody else thinks is reasonable to put the energy into. They begin writing scripts that process all the different formats that all the data is in, and eventually get the whole thing working ! They can then, in theory, make a number of mind-numbing data-processing jobs obsolete. | ||
+ | |||
+ | Google has put a lot of energy into conquering this challenge on many, many fronts around the decade of the 2000s, making data more processable everywhere, and possible hastening the advent of those strong artificial intelligences, that would thrive off of the information available in already-digitized information. A notable project was google books, where libraries were scoured for non-digital information and it was all painstakingly scanned. Additionally, organizations have been increasingly pressured to offer their information in standardized formats that can all be processed the same way. This continued pressure is giving more and more results, but because it must be implemented by humans who gain little immediately from the process, it is rare that adherence to the guidelines is universal. | ||
+ | |||
+ | The workaround of building many small programs that handle all the quirks is the domain of "scraping" -- downloading information intended to be presented to a human, running it through software that has been pre-programmed with what patterns to expect, and normalizing and making use of the data. | ||
+ | Anybody who has, as a mere individual, attempted this goal, quickly realizes that as soon as the data source has the smallest change, the data becomes garbage. Often it becomes garbage in a way that is laborious to hunt down and understand, and may not even be noticed. This would be tragic for a corporation that was relying on the results, and would be like a trojan horse, destroying them from the inside. | ||
==Transcript== | ==Transcript== | ||
+ | {{incomplete transcript|Do NOT delete this tag too soon.}} | ||
:[Cueball is standing with an open laptop, showing it to Ponytail and White Hat.] | :[Cueball is standing with an open laptop, showing it to Ponytail and White Hat.] | ||
:Cueball: Check it out - I made a full automated data pipeline that collects and processes all the information we need. | :Cueball: Check it out - I made a full automated data pipeline that collects and processes all the information we need. |