Difference between revisions of "2899: Goodhart's Law"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(Added transcript)
(Explanation: s/stss/sts/)
 
(52 intermediate revisions by 38 users not shown)
Line 10: Line 10:
  
 
==Explanation==
 
==Explanation==
{{incomplete|Created by a BOT - Please change this comment when editing this page. Do NOT delete this tag too soon.}}
+
 
 +
In this comic, [[White Hat]] suggests creating a meta-metric, "number-of-metrics-that-have-become-targets," and making it a target.
 +
 
 +
First, Cueball introduces and defines {{w|Goodhart's Law}}, which is the observation that when a metric — a {{w|performance indicator|measure of performance}} — becomes a goal, efforts will be unhelpfully directed to improving that ''metric'' at the expense of systemic objectives.
 +
 
 +
For example, imagine a scenario in which a car dealership is looking to grow profits, and its managers decide to focus on increasing a component metric of profit: how many cars it sells. So they offer a bonus to their salespeople to sell more cars. But then the salespeople offer deep discounts to rack up sales, rendering the car sales unprofitable. This example shows how a ''metric'' (cars sold) can become the ''target'', replacing the real target, profit growth, if individual incentives are not properly managed.
 +
 
 +
Hearing about Goodhart's Law, White Hat suggests eliminating metrics that have become targets.
 +
 
 +
White Hat's suggestion could be a good or a bad idea. It all depends on how the bonus incentive is awarded:
 +
 
 +
* A '''well-designed implementation''' would award bonuses only for finding metrics which truly aren't serving their purpose, so the organization's managers could fix the measurement issues (assuming the fix isn't worse than the status quo), and would employ sufficient management oversight to discourage trivial submissions. If submissions are in good faith, bonuses are awarded only for approved submissions, and the identifications result in real improvements, the organization will likely be better off.
 +
 
 +
* A '''poorly-designed implementation''' would offer a bonus to every identification, regardless of quality. This would incentivize the identification of even quite useful metrics — and perhaps even the ''creation'' of new metrics-as-targets for the sole purpose of then removing them and collecting the bounty.
 +
 
 +
The title text imagines this '''poorly-designed implementation''', leading to the creation of a new metric (metric changes per hour) and the organization identifying — and ''replacing'' — hundreds of metrics per hour, crowding out actual focus on the organization's true goals. It's the ultimate example of "change for change's sake."
 +
 
 +
Part of the joke is that White Hat's original suggestion — the new metric causing the issue and one that ''should'' be replaced — seems to be ironically surviving the replacement of hundreds of other metrics.
 +
 
 +
This comic illustrates that the thoughtless combination of Goodhart's Law and poorly designed incentives can have ruinous results for an organization.
 +
 
 +
The proper usage of organizational metrics and incentives is the focus of {{w|managerial accounting}}, a field within organizational management.
 +
 
 +
===Discussion of the promises and perils of operational measurement===
 +
While there is a temptation to game any metric, measurement is the main objective way of describing the success of an activity and assessing the effect of changes. "Data-driven" or "evidence-based" approaches are used to drive measurable improvements in various areas of society.
 +
 
 +
Discussions of Goodhart's Law have noted [https://commoncog.com/goodharts-law-not-useful/] that people may respond to a metric by either (1) improving the system, (2) distorting that system (examples below), or (3) distorting the data (e.g., governments publishing false or cherry-picked economic data). Channeling energy toward improvement requires an organization to make (1) more appealing (flexibility and culture) and the others less (transparency, culture, reduced pressure to meet unrealistic goals). Figuring out how to do that involves a slow and thoughtful process unlike White Hat's kneejerk jump to a new metric.
 +
 
 +
===Additional examples of Goodhart's Law===
 +
* The classical example of Goodhart's Law is the {{w|Perverse_incentive#The_original_cobra_effect|Cobra Effect}}: anecdotally the British rule in India paid bounties for dead cobras as a pest control effort. People quickly realized that more cobras allowed them to harvest more for the bounty, and began actively breeding cobras.
 +
* School test scores are intended as a metric for how well a school is teaching its students. When that becomes an incentivized target, schools are forced to design their curriculum around the exams, which can create a more rigid system which fails to engage students and teachers. In extreme cases, this can motivate decisions to remove underperforming students from school districts, or encourage teachers to allow or even facilitate cheating.
 +
* A hospital measures inpatient ''Length of Stay'' because shorter stays save money and free up beds for other patients. But this metric, on its own, may encourage doctors to discharge patients too soon. This not only puts patients at risk, but can also result in costly re-admissions.
 +
* A call center measures the number of calls handled per hour as a measure of worker productivity. This can drive workers to rush through calls, terminating them as quickly as possible, which can lead to short, frustrating interactions.
 +
* The hypothetical {{w|Instrumental convergence#Paperclip maximizer|Paperclip Maximizer}} concept demonstrates how having a seemingly benign metric as a goal might still result in almost unlimited adverse effects, if unchecked.
  
 
==Transcript==
 
==Transcript==
{{incomplete transcript|Do NOT delete this tag too soon.}}
+
:[Cueball and White Hat are standing and talking, White Hat with hand on his chin.]
:[Cueball and White Hat standing. Cueball is in a neutral pose while White Hat is pondering with his hand on his chin.]
 
:Cueball: When a metric becomes a target, it ceases to be a good metric.
 
:White Hat: Sounds bad. Let's offer a bonus to anyone who identifies a metric that has become a target.
 
 
 
:[Cueball is talking to White Hat. White Hat has a hand on his chin.]
 
 
:Cueball: When a metric becomes a target, it ceases to be a good metric.
 
:Cueball: When a metric becomes a target, it ceases to be a good metric.
 
:White Hat: Sounds bad. Let's offer a bonus to anyone who identifies a metric that has become a target.
 
:White Hat: Sounds bad. Let's offer a bonus to anyone who identifies a metric that has become a target.

Latest revision as of 16:39, 29 February 2024

Goodhart's Law
[later] I'm pleased to report we're now identifying and replacing hundreds of outdated metrics per hour.
Title text: [later] I'm pleased to report we're now identifying and replacing hundreds of outdated metrics per hour.

Explanation[edit]

In this comic, White Hat suggests creating a meta-metric, "number-of-metrics-that-have-become-targets," and making it a target.

First, Cueball introduces and defines Goodhart's Law, which is the observation that when a metric — a measure of performance — becomes a goal, efforts will be unhelpfully directed to improving that metric at the expense of systemic objectives.

For example, imagine a scenario in which a car dealership is looking to grow profits, and its managers decide to focus on increasing a component metric of profit: how many cars it sells. So they offer a bonus to their salespeople to sell more cars. But then the salespeople offer deep discounts to rack up sales, rendering the car sales unprofitable. This example shows how a metric (cars sold) can become the target, replacing the real target, profit growth, if individual incentives are not properly managed.

Hearing about Goodhart's Law, White Hat suggests eliminating metrics that have become targets.

White Hat's suggestion could be a good or a bad idea. It all depends on how the bonus incentive is awarded:

  • A well-designed implementation would award bonuses only for finding metrics which truly aren't serving their purpose, so the organization's managers could fix the measurement issues (assuming the fix isn't worse than the status quo), and would employ sufficient management oversight to discourage trivial submissions. If submissions are in good faith, bonuses are awarded only for approved submissions, and the identifications result in real improvements, the organization will likely be better off.
  • A poorly-designed implementation would offer a bonus to every identification, regardless of quality. This would incentivize the identification of even quite useful metrics — and perhaps even the creation of new metrics-as-targets for the sole purpose of then removing them and collecting the bounty.

The title text imagines this poorly-designed implementation, leading to the creation of a new metric (metric changes per hour) and the organization identifying — and replacing — hundreds of metrics per hour, crowding out actual focus on the organization's true goals. It's the ultimate example of "change for change's sake."

Part of the joke is that White Hat's original suggestion — the new metric causing the issue and one that should be replaced — seems to be ironically surviving the replacement of hundreds of other metrics.

This comic illustrates that the thoughtless combination of Goodhart's Law and poorly designed incentives can have ruinous results for an organization.

The proper usage of organizational metrics and incentives is the focus of managerial accounting, a field within organizational management.

Discussion of the promises and perils of operational measurement[edit]

While there is a temptation to game any metric, measurement is the main objective way of describing the success of an activity and assessing the effect of changes. "Data-driven" or "evidence-based" approaches are used to drive measurable improvements in various areas of society.

Discussions of Goodhart's Law have noted [1] that people may respond to a metric by either (1) improving the system, (2) distorting that system (examples below), or (3) distorting the data (e.g., governments publishing false or cherry-picked economic data). Channeling energy toward improvement requires an organization to make (1) more appealing (flexibility and culture) and the others less (transparency, culture, reduced pressure to meet unrealistic goals). Figuring out how to do that involves a slow and thoughtful process unlike White Hat's kneejerk jump to a new metric.

Additional examples of Goodhart's Law[edit]

  • The classical example of Goodhart's Law is the Cobra Effect: anecdotally the British rule in India paid bounties for dead cobras as a pest control effort. People quickly realized that more cobras allowed them to harvest more for the bounty, and began actively breeding cobras.
  • School test scores are intended as a metric for how well a school is teaching its students. When that becomes an incentivized target, schools are forced to design their curriculum around the exams, which can create a more rigid system which fails to engage students and teachers. In extreme cases, this can motivate decisions to remove underperforming students from school districts, or encourage teachers to allow or even facilitate cheating.
  • A hospital measures inpatient Length of Stay because shorter stays save money and free up beds for other patients. But this metric, on its own, may encourage doctors to discharge patients too soon. This not only puts patients at risk, but can also result in costly re-admissions.
  • A call center measures the number of calls handled per hour as a measure of worker productivity. This can drive workers to rush through calls, terminating them as quickly as possible, which can lead to short, frustrating interactions.
  • The hypothetical Paperclip Maximizer concept demonstrates how having a seemingly benign metric as a goal might still result in almost unlimited adverse effects, if unchecked.

Transcript[edit]

[Cueball and White Hat are standing and talking, White Hat with hand on his chin.]
Cueball: When a metric becomes a target, it ceases to be a good metric.
White Hat: Sounds bad. Let's offer a bonus to anyone who identifies a metric that has become a target.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

I don't think there's anything else that could be included in the transcript, so i'm deleting the incomplete tag. if anyone has an idea to make it better, just add it. i know it seems too soon, but there's really nothing else to the comic. New editor (talk) 22:17, 26 February 2024 (UTC)

This happens all the time. For instance, a call center whose metric-turned-target is number of calls handled per hour (which sounds good in theory) is incentivised to hang up on callers, who then call back - increasing their "performance" as measured by the target, as it both decreases the time each call takes (thus making time for more calls) and increases the volume of incoming calls. Of course, the side effect is ticked-off customers heading to competitors instead. (Which often doesn't affect the call center as it's a third party.) If the metric-turned-target is getting a good survey response at the end of the call, treating the customer so badly they hang up (and thus don't take the survey) for any call that is going poorly becomes a viable way of improving the measurement of their performance. Creating good targets is HARD. 172.70.43.157 22:38, 26 February 2024 (UTC)

Moderator (talk) 23:12, 26 February 2024 (UTC) Moderator (talk) 23:12, 26 February 2024 (UTC) Moderator (talk) 23:12, 26 February 2024 (UTC)

The above, by 'Moderator' appears to be a meta-joke. i.e. trying to enhance 'times signed', which of course isn't even a useful measure, at the expense of bringing anything useful to the situation. It was even done in just one edit, so didn't even increase the standard 'contributions' measure that an actual target-hitter might try to hit.
Either that or they messed up/have other machinations in mind. But I just thought I'd 'dissect the frog' for future readers. 172.70.91.165 04:19, 27 February 2024 (UTC)
Moderator (talk) 12:30, 28 February 2024 (UTC) Moderator (talk) 12:30, 28 February 2024 (UTC) Moderator (talk) 12:30, 28 February 2024 (UTC)
Yep, got it. 172.70.85.27 12:49, 28 February 2024 (UTC)
are you sure? Moderator (talk) 23:19, 28 February 2024 (UTC) Moderator (talk) 23:19, 28 February 2024 (UTC) Moderator (talk) 23:19, 28 February 2024 (UTC)

The main problem with metrics is that there can be too many (everything is a metric, you're chasing targets even if just trying to be the most average and not to be an outlier) or there are too few (everything is 'boiled down' to a single figure of 'success', with no nuance available to work out why it's marked as "good" rather than "excellant ). Or both at the same time! That said, I think changing a target-system to be a less-worse-target-system is often the worst of all worlds, as every meaningful measure is changed, and/or the means to measure them are changed, all this impinging upon the actual job of work that was actually always supposed to be done, regardless... 172.70.91.165 04:19, 27 February 2024 (UTC)

Probably the worst metric/target is the perpetual growth delusion. Your office furniture sales figures are down fifteen percent from this month last year. Nevermind that they were up three thousand percent last year because your biggest customer had to replace the furniture lost in a fire. 172.71.26.16 06:33, 27 February 2024 (UTC)

Feels like this comic is really about how incentives are difficult. A metric only becomes a target if there's an incentive, and that's only a problem if the incentive is poorly conceived. For anyone who hasn't spent a lot of time thinking about metrics and reads this comic and thinks that metrics are the crux of the issue, they're not; incentive design is. Laser813 (talk) 11:53, 27 February 2024 (UTC)

Yes, and no. Metrics in and of themselves have a psychological power and tend to direct attention, and therefore action, to the things being measured. So good incentive design (and other psychological framing) is then needed to counteract that biasing effect.172.70.90.28 14:08, 27 February 2024 (UTC)
The issue comes in the moment the incentive is to "improve the metric" rather than "improve the thing the metric is intended to indicate." For example, there's the Hot Waitress Economic Index, whereby the sexier the average waitress, the worse the economy is doing (as attractive women usually have no problem getting jobs in sales when the economy is doing well). If someone comes up with the brilliant idea of fixing the economy by recruiting more unattractive waitresses, the metric no longer measures the thing it is supposed to at all. 172.69.247.49 18:22, 27 February 2024 (UTC)
Exactly. It can be incredibly mundane things - a store I worked in encouraged the inclusion of accessories with main purchases, obviously, but also used to discourage us from selling accessories if customers remembered as they were leaving, after the main sale. If we "allowed" it, the Average Transaction Value and Items Per Basket indicators would both be down. Same stuff being sold, but if it was sold separately from the thing it supplemented, that was a bad thing.
It can also be much bigger, more important things - good figures for DEI targets doesn't necessarily mean attitudes towards people from traditionally disadvantaged demographics have improved, it just means firms have been told to employ more of them. If somebody is given a leg up but you only measure how many are sitting up high...how do you tell if the need for a leg up is lessening? And are you really combating the wider need for legups to be given if you keep giving them to ensure targets are met? What's the incentive for improving the big picture if the obsession is with improving a few small details? Yorkshire Pudding (talk) 22:47, 27 February 2024 (UTC)
Ugh, yes. Some companies meet their DEI targets by interviewing people based on DEI criteria instead of looking at skills and experience. I was once denied an interview that way - being a white male candidate, the hiring manager explicitly told me I couldn't be considered until all the "diversity candidates" had been rejected. 172.70.42.150 00:05, 29 February 2024 (UTC)

In early days of computer programming managers tried to assess the performance of programmers in a way that they would assess the performance of assembly line workers and decided to use the metric of "lines of code per day". The results were laughable. There was also the, possibly apocryphal, story from the old Soviet Union where the government rewarded automobile plants for meeting certain quotas for number of cars produced, and rewarded scrap metal facilities for meeting certain quotas for number of cars demolished, and it wasn't long before the facilities figured out that delivering the cars of dubious value straight to junk yards was the most efficient and rewarding way to operate. Rtanenbaum (talk) 21:08, 27 February 2024 (UTC)

Unfortunately, some languages including Java still have build-in support for such measurements. -- Hkmaly (talk) 15:26, 29 February 2024 (UTC)

Need to remove some text[edit]

The whole section from the headline "Discussion of the promises and perils of operational measurement" up to the transcript should be eliminated. This page is supposed to be an explanation of a comic, not an exposition on operational management! If such an exposition is really needed (IMO it's not, but there's room for disagreement), please just put in a link to one, don't copy it here. DKMell (talk) 04:09, 29 February 2024 (UTC)

Disagree. It's helpful. Besides, what about the metrics of explanation length? -- Hkmaly (talk) 15:24, 29 February 2024 (UTC)
Also disagree, though if you think it's long-winded, I endorse editing for brevity. Laser813 (talk) 15:48, 29 February 2024 (UTC)