Editing 2435: Geothmetic Meandian

{{comic
| number    = 2435
| date      = March 10, 2021
| title     = Geothmetic Meandian
| image     = geothmetic_meandian.png
| titletext = Pythagorean means are nice and all, but throwing the median in the pot is really what turns this into random forest statistics: applying every function you can think of, and then gradually dropping the ones that make the result worse.
}}

==Explanation==

There are a number of different ways to identify the '{{w|average}}' value of a series of values, the most common unweighted methods being the {{w|median}} (take the central value from the ordered list of values if there are an odd number - or the value half-way between the two that straddle the divide between two halves if there are an even number) and the {{w|arithmetic mean}} (add all the numbers up, divide by the number of numbers). The {{w|geometric mean}} is less well-known but works similarly to the arithmetic mean. To take the geometric mean of 'n' values, they are multiplied and then the 'n'th root is taken. It will be seen that for purely identical values this returns the single value as the singular average, as would the arithmetic calculation with serial addition then re-division, but it reacts differently to any perturbed values. You might also consider operating arithmetically upon logarithms of the list, then re-exponate the result.

The geometric mean, arithmetic mean and {{w|harmonic mean}} (not shown) are collectively known as the {{w|Pythagorean means}}, as specific modes of a greater and more generalised mean formula that extends arbitrarily to various other possible nuances of mean-value rationisations (cubic, etc).

{{w|Outlier}}s and internal biases within the original sample can make boiling down a set of values into a single 'average' sometimes overly biased by flaws in the data, with your choice of which method to use perhaps resulting in a value that is misleading, exagerating or suppressing the significance of any blips.

<!-- Either here or after the next paragraph, demonstrate how (1,1,2,3,5) resolves in each individual method, perhaps? -->

In this depiction, the three named methods of averaging are embedded within a single function that produces a sequence of three values - one output for each of the methods. Being a series of values, Randall suggests that this is ideally suited to being ''itself'' subjected to the comparative 'averaging' method. Not just once, but as many times as it takes to narrow down to a sequence of three values that are very close to one another. 

It can be shown that the xkcd value of 2.089 for GMDN(1,1,2,3,5) is validated:

{|-
 | F0 || 1 || 1 || 2 || 3 || 5 
 |-
 |    || Ave || Geomean || Median ||
 |-
 | F1 || 2.4 || 1.974350486 || 2		
 |-
 | F2 || 2.124783495 ||	2.116192461 || 2		
 |-
 | F3 || 2.080325319 || 2.079536819 || 2.116192461		
 |-
 | F4 || 2.0920182 || 2.091948605 || 2.080325319		
 |-
 | F5 || 2.088097374 || 2.088090133 || 2.091948605		
 |-
 | F6 || 2.089378704 ||	2.089377914 || 2.088097374		
 |-
 | F7 || 2.088951331 ||	2.088951244 || 2.089377914		
 |-
 | F8 || 2.089093496 || 2.089093487 || 2.088951331		
 |-
 | F9 || 2.089046105 || 2.089046103 || 2.089093487		
 |-
 | F10 || '''2.089061898''' || '''2.089061898''' || '''2.089046105'''		
 |}

The function GMDN in the comic is  not properly defined since F acts on a vector to produce another three vector, so repeated applications of F will always result in a 3 vector for which the ave, geomean and median can be iterated again. However GMDN is shown to produce a single real number rather than a vector. It is thus missing a final operation of returning any of the values of the components of the vector. Each row shows the set Fn(..) composed of the average, geomean and median computed on the previous row, with the sequence {1,1,2,3,5} as the initial F0. Since the average, geomean and median are all forms of averaging, and the composition of averages can be shown to be equivalent to a smoothing function, the value of GMDN will converge to a singular value for any set of starting values. This can be interpreted as similar to a heat equation which approaches equilibrium.

The comment in the title text about suggests that this will save you the trouble of committing to the 'wrong' analysis as it gradually shaves down any 'outlier average' that is unduly affected by anomalies in the original inputs. It is a method without any danger of divergence of values, since all three averaging methods stay within the interval covering the input values (and two of them will stay strictly within that interval).

The title text may also be a sly reference to an actual mathematical theorem, namely that if one performs this procedure only using the arithmetic mean and the harmonic mean, the result will converge to the geometric mean. Randall suggests that the (non-Pythagorean) median, which does not have such good mathematical properties with relation to convergence, is, in fact, the secret sauce in his definition.

There does exist an {{w|arithmetic-geometric mean}}, which is defined identically to this except with the arithmetic and geometric means, and sees some use in calculus.  In some ways it's also philosophically similar to the {{w|truncated mean}} (extremities of the value range, e.g. the highest and lowest 10%s, are ignored as not acceptable and not counted) or {{w|Winsorized mean}} (instead of ignored, the values are readjusted to be the chosen floor/ceiling values that they lie beyond, to still effectively be counted as 'edge' conditions), only with a strange dilution-and-compromise method rather than one where quantities can be culled or neutered just for being unexpectedly different from most of the other data.

The following python code (inefficiently) implements the above algorithm:

<pre>
from functools import reduce
from itertools import count


def f(*args):
    args = sorted(args)
    mean = sum(args) / len(args)
    gmean = reduce(lambda x, y: x * y, args) ** (1 / len(args))
    if len(args) % 2:
        median = args[len(args) // 2]
    else:
        median = (args[len(args) // 2] + args[len(args) // 2 - 1]) / 2
    return mean, gmean, median


max_number_of_iterations = 10
l0 = [1, 1, 2, 3, 5]
l = l0
for iterations in range(max_number_of_iterations):
    fst, *rest = l
    if all((abs(r - fst) < 0.00000001 for r in rest)):
        break
    l = f(*l)
print(l[0], iterations)
</pre>

And here is an implementation of the Gmdn function in R:

    Gmdn <- function (..., threshold = 1E-6) {
      # Function F(x) as defined in comic
      f <- function (x) {
        n <- length(x)
        return(c(mean(x), prod(x)^(1/n), median(x)))
      }
      # Extract input vector from ... argument
      x <- c(...)
      # Iterate until the standard deviation of f(x) reaches a threshold
      while (sd(x) > threshold) x <- f(x)
      # Return the mean of the final triplet
      return(mean(x))
    }

The input sequence of numbers (1,1,2,3,5) chosen by Randall is also the opening of the {{w|Fibonacci sequence}}.  This may have been selected because the Fibonacci sequence also has a convergent property: the ratio of two adjacent numbers in the sequence approaches the [https://en.wikipedia.org/wiki/Golden_ratio#Relationship_to_Fibonacci_sequence golden ratio] as the length of the sequence approaches infinity.

==Transcript==
{{incomplete transcript|Do NOT delete this tag too soon.}}

F(x1,x2,...xn)=({x1+x2+...+xn/n [bracket: arithmetic mean]},{nx,x2...xn, [bracket: geometric mean]} {x n+1/2 [bracket: median]})

Gmdn(x1,x2,...xn)={F(F(F(...F(x1,x2,...xn)...)))[bracket: geothmetic meandian]}

Gmdn(1,1,2,3,5) [equals about sign] 2.089

Caption: Stats tip: If you aren't sure whether to use the mean, median, or geometric mean, just calculate all three, then repeat until it converges

{{comic discussion}}
<!--
For a start, there is a syntax error. After the first application of F, you get a 3-tuple. Subsequent iterations preserve the 3-tuple, and we need to analyze the resulting sequence.
Perhaps there is an implicit claim all three entries converge to the same result. In any case, lets see what we get:

Wlog, we have three inputs (x_1,y_1,z_1), and want to understand the iterates of the map 
F(x,y,z) = ( (x+y+z)/3, cube root of (xyz), median(x,y,z) ). Lets write F(x_n,y_n,z_n) = (x_{n+1},y_{n+1},z_{n+1}).

The inequality of arithmetic and geometric means gives x_n \geq y_n, if n \geq 2,  and
-->

[[Category:Math]]
[[Category:Statistics]]