χ² Calculation

The χ² table in a statistics book enumerates values of the χ² Cumulative Distribution Function, CDF.

F(x;k) = (γ(k)/(2), (x)/(2))/(Γ(k)/(2))

See http://en.wikipedia.org/wiki/Chi-squared_distribution

Given a sum-of-squares value, χ2, and degrees of freedom, f. We can compute p = 1 − F2;k) which is the probability of the χ2 being random. A value of p ≤ 0.05 means the data is unlikely to be random. A value of p > 0.05 means that the null hypothesis is probably true: the data is random. The higher the p value, the more likely that the null hypothesis is true.

This document presents the computation of the the CDF, F(x;k). It shows a number of approaches to calculations of the two required values, γ(s, z), and Γ(t). A final choice for a useful complete gamma function is chosen based on accuracy for the given use case.

See http://en.wikipedia.org/wiki/Incomplete_gamma_function#Regularized_Gamma_functions_and_Poisson_random_variables

See http://en.wikipedia.org/wiki/Stirling%27s_approximation

See http://dlmf.nist.gov/5 and http://dlmf.nist.gov/8

See http://netlib.org/ Specifically module 542

Imports

Modules required by this module.

import operator
from functools import reduce, lru_cache
from fractions import Fraction
import math
from typing import Iterator, Tuple, Callable, Iterable, TypeVar

Factorial

Simple integer factorial is used in a variety of places for computing the incomplete and complete gamma function values.

We'll use this definition of factorial:

n! = ni = 1i
@lru_cache(128)
def fact(k: int) -> int:
    """Simple factorial.

    >>> fact(1)
    1
    >>> fact(2)
    2
    >>> fact(3)
    6
    >>> fact(4)
    24
    """
    if k < 2:
        return 1
    return reduce(operator.mul, range(2,k+1))

The implementation uses reduce( operator.mul, ... ) to compute the product of a sequence of integer values. We've included the @lru_cache because this is used often, and the small domain of possible values leads to some benefit from the cache.

We could also use math.factorial(). In order to make use of the cache, we would need to do something like this.

fact = lru_cache(128)(math.factorial)

This would create a similarly cached factorial function.

Incomplete Gamma

The incomplete (lower) gamma function is this:

γ(s, z) = k = 0(( − 1)k)/(k!)(zs + k)/(s + k)
def gamma(s: float, z: float) -> float:
    """Incomplete gamma function.

    >>> import math
    >>> round(gamma(1, 2),7)
    0.8646647
    >>> round(1-math.exp(-2),7)
    0.8646647
    >>> round(gamma(1, 3),7)
    0.9502129
    >>> round(1-math.exp(-3),7)
    0.9502129
    >>> round(gamma(0.5, 2),7)
    1.6918067
    >>> round(math.sqrt(math.pi)*math.erf(math.sqrt(2)),7)
    1.6918067
    """

    def terms(s: float, z: float) -> Iterator[float]:
        for k in range(1000):
            term = ((-1)**k/fact(k))*(z**(s+k)/(s+k))
            yield term

    T_ = TypeVar("T_")
    def take_until(function: Callable[[T_], bool], source: Iterable[T_]) -> Iterator[T_]:
        for v in source:
            if function(v): return
            yield v
    ε = 1E-8
    return sum(take_until(lambda t: abs(t) < ε, terms(s, z)))

The idea here is to compute an infinite sequence of values for (( − 1)k)/(k!)(zs + k)/(s + k). We enumerate these values while they're greater than ϵ.

We wrote our own filter function take_until(end_condition, function). This stops the infinite generation of terms when the end condition is met.

One alternative to this is a single tail-recursion, which we can optimize to create a simple for loop which emits terms until the values are too small to be relevant.

sigma = 0
for k in range(1000):
    term = ((-1)**k/fact(k))*(z**(s+k)/(s+k))
    if abs(term) < ε: break
    sigma += term
return sigma

We could futher optimize this function to use stateful internal variables in the for loop. We could slightly improve performance with two changes.

The (-1)**k expression flips the sign on each term. We could use, for example, 1 if k%2 == 0 else -1, which would somewhat less expensive to evaluate.

Similarly, we don't really need to compute k! for increasing values of k. We could, instead, keep a stateful running product calculation that we simply multiply by the next value of k.

Gamma Function 1

The various complete gamma function approxiations are all slightly wrong in various places. We've done three implementations in an attempt to locate a "best" function.

The math.gamma() function is available in Python 3.2. We can compare this with our requirements, also.

The complete gamma function has this definition:

Γ(t) = (1)/(t)k = 1(1 + (1)/(k)t)/(1 + (t)/(k))
def Gamma1(t: float) -> float:
    """Gamma Function.

    Gamma(n) == fact(n-1)

    >>> import math
    >>> round(Gamma1(2),0)
    1.0
    >>> round(Gamma1(3),0)
    2.0
    >>> round(Gamma1(4),0)
    6.0
    >>> round(Gamma1(5),0)
    24.0
    >>> round(Gamma1(.5), 7) # Not quite right
    1.7726754
    >>> round(math.sqrt(math.pi), 7)
    1.7724539
    """

    def num_den(t: float) -> Iterator[Tuple[float, float]]:
        for k in range(1, 1000):
            yield (1+1/k)**t, (1+t/k)

    T_ = TypeVar("T_")
    def take_until_star(
            function: Callable[[T_, T_], bool],
            source: Iterable[Tuple[T_, ...]]
        ) -> Iterable[Tuple[T_, ...]]:
        for v in source:
            if function(*v): return
            yield v

    prod = lambda x: reduce(operator.mul, x)
    fst = lambda x: x[0]
    snd = lambda x: x[1]
    ε = 1E-8

    terms = tuple(take_until_star(lambda n, d: abs(n/d-1) < ε, num_den(t)))

    return prod(map(fst, terms))/(t*prod(map(snd, terms)))

This involves two products: the numerator product and the denominator product. The num_den() function emits an sequence of pairs, (n, d).

We could consider this as a sequence of pairs like this:

1 + (1)/(k)t, 1 + (t)/(k) for1 ≤ k < ∞

We wrote our own filter function take_until_star(end_condition, function). This stops the infinite generation of tuples when the end condition for the tuple is met. We can't use the simple take_util() defined above because it makes working with tuples a little bit clunky. This version works more elegantly with multi-valued tuples.

If the (1 + (1)/(k)t)/(1 + (t)/(k)) fraction is close to 1, we can stop taking values from the infinite iterator. We'll save this sequence in a materialized object, terms, because we need to do two reductions on the sequence.

We then split the two values in the terms sequence using fst() and snd() functions. This allows us to compute the numerator and denominator products separately. We defer doing the final division to the very end to preserve as many bits of accuracy as possible.

Here's an alternative loop construction.

p_num = p_den = 1
for n in range(1,1000):
    num, den = (1+1/n)**t, (1+t/n)
    if abs(num/den-1) < ε: break
    p_num *= num
    p_den *= den
return p_num/(t*p_den)

Note that the value of Γ(1)/(2) is very close to the defined value of (π). The integer values, however, require rounding to zero places to show the expected values.

Here's a second version using Fraction objects instead of floats. The idea here is to compute the exact value to see if -- perhaps -- the slight discrepancies between actual and expected are due to float issues.

This only works for int and Fraction values. It doesn't work for arbitrary float values. That's not a big limitation for this application.

def Gamma1f(t: float) -> float:
    """Gamma Function.

    Gamma(n) == fact(n-1)

    >>> import math
    >>> round(Gamma1f(2),0)
    Fraction(1, 1)
    >>> round(Gamma1f(3),0)
    Fraction(2, 1)
    >>> round(Gamma1f(4),0)
    Fraction(6, 1)
    >>> round(Gamma1f(5),0)
    Fraction(24, 1)
    >>> round(Gamma1f(Fraction(1,2)), 7) # Not quite right
    1.7726754
    >>> round(math.sqrt(math.pi), 7)
    1.7724539
    """

    def num_den(t: Fraction) -> Iterator[Tuple[Fraction, float]]:
        for k in range(1, 1000):
            yield (1+Fraction(1, k))**t, (1+t/k)

    T_ = TypeVar("T_")
    def take_until_star(
            function: Callable[[T_, T_], bool],
            source: Iterable[Tuple[T_, ...]]
        ) -> Iterable[Tuple[T_, ...]]:
        for v in source:
            if function(*v): return
            yield v

    prod = lambda x: reduce(operator.mul, x)
    fst = lambda x: x[0]
    snd = lambda x: x[1]
    ε = 1E-8

    t_f = Fraction(t)
    terms = tuple(take_until_star(lambda n, d: abs(n/d-1) < ε, num_den(t_f)))

    return prod(map(fst, terms))/(t_f*prod(map(snd, terms)))

We've replaced the division operation in the num_den() function with Fraction(). We've also replaced the argument value, t, with a Fraction, t_f.

Two other divisions were left in place because the arguments would be Fraction instances:

  • In the num_den() function, we left a division because the argument, t will be a Fraction.
  • The final division (betweem two Fraction objects) is left in place.

Interestingly, this produces essentially the same results as the previous version. This, too, is not very accurate for values near :math:\frac{1}{2}.

Gamma Function 2

It's not clear that we can do better than gamma1(), but it's worth looking at other alternatives. The Nemes approximation has the advantage of brevity.

Γ(z) ~ ((2π)/(z))(1)/(e)z + (1)/(12z − (1)/(10z))z
def Gamma2(z: float) -> float:
    """Gamma Function. Gergő Nemes version.

    Gamma(n) == fact(n-1)

    >>> import math
    >>> round(Gamma2(2),1)
    1.0
    >>> round(Gamma2(3),1)
    2.0
    >>> round(Gamma2(4),1)
    6.0
    >>> round(Gamma2(5),1)
    24.0
    >>> round(Gamma2(.5), 7) # Not quite right
    1.7630962
    >>> round(math.sqrt(math.pi), 7)
    1.7724539
    """
    t_1 = math.sqrt(2*math.pi/z)
    t_2 = (z+(1/(12*z-(1/(10*z)))))/math.e
    return t_1*t_2**z

We broke the expression into two parts just to shorten up the expression.

t1  = ((2π)/(z)) t2  = (z + (1)/(12z − (1)/(10z)))/(e) Γ(z)  = t1 × t2z

This is a simple, closed-form evaluation of a function. It's remarkably inaccurate for Γ(1)/(2). For integer values, however, it's very good.

Gamma Function 3

Here's the Sterling version. This sums a sequence of values.

Γ(z) ~ e − zzz((2π)/(z))k = 0(gk)/(zk)

Here is the beginning of the sequence of gk values.

g0  = 1,  g1  = 1 ⁄ 12,  g2  = 1 ⁄ 288,  g3  =  − 139 ⁄ 51840,  g4  =  − 571 ⁄ 24 88320,  g5  = 1 63879 ⁄ 2090 18880,  g6  = 52 46819 ⁄ 7 52467 96800.

There's a defined rule for calculating these obscure-looking values. However, we note that the last of these is near 10E-5, this is enough to give us a usefully accurate answer.

This is not a convergent series: using more terms used may actually reduce the accuracy of the approximation. For the values we're working with, we don't seem to need all six terms shown above.

def Gamma3(z: float) -> float:
    """Gamma Function. Sterling version.

    http://dlmf.nist.gov/5.11#E3

    Gamma(n) == fact(n-1)

    >>> import math
    >>> round(Gamma3(2),1)
    1.0
    >>> round(Gamma3(3),1)
    2.0
    >>> round(Gamma3(4),1)
    6.0
    >>> round(Gamma3(5),1)
    24.0
    >>> round(Gamma3(.5), 7)
    1.7737381
    >>> round(math.sqrt(math.pi), 7)
    1.7724539
    """
    t_1 = math.exp(-z)*z**z
    t_2 = math.sqrt(2*math.pi/z)
    g = [1, 1/12, 1/288, -139/51840, -571/2488320, 163879/209018880, 5246819/75246796800]
    t_3 = sum(g[k]/(z**k) for k in range(2))
    return t_1*t_2*t_3

We've broken this into three parts, to reduce the size of the overall expression.

t1  = e − zzz t2  = ((2π)/(z)) t3  = k = 0(gk)/(zk) Γ(z)  = t1 × t2 × t3

This is acceptable for Γ(1)/(2). Note that we're only using the first two terms. Compare these results with the Gamma2() function, shown above. The number of terms required seems to vary with the range of the vaue of z.

Gamma Hybrid

Most of the approximations aren't very accurate for the Γ(k)/(2) values. We can use this exact closed-form expression instead:

Γ(1)/(2) + n = ((2n)!)/(4nn!)(π)

This provides precise values for the special cases we're using.

Here's a hybrid Gamma() function. For certain values, we'll use the exact Γ(1)/(2) + n value. For other values, we'll use the Gamma2() approximation, above.

def Gamma_Half(k: float) -> float:
    """Gamma(k) with special case for k = n+1/2; k-1/2=n.

    >>> import math
    >>> round(Gamma_Half(2),1)
    1.0
    >>> round(Gamma_Half(3),1)
    2.0
    >>> round(Gamma_Half(4),1)
    6.0
    >>> round(Gamma_Half(5),1)
    24.0

    >>> round(Gamma_Half(.5), 7)
    1.7724539
    >>> round(math.sqrt(math.pi), 7)
    1.7724539
    >>> round(Gamma_Half(1.5), 7)
    0.8862269
    >>> round(math.sqrt(math.pi)/2, 7)
    0.8862269
    """
    ε = 1E-6
    if abs(k-int(k)-.5) < ε:
        n = int(k-.5)
        return fact(2*n)/(4**n*fact(n))*math.sqrt(math.pi)
    else:
        return float(Gamma2(k))

If the value is an n + (1)/(2)±ϵ, we'll use the special close-form expression. If the value is not close to n + (1)/(2), we'll use a more general approximation.

The math.gamma() Version

Here's a test case for the math.gamma() function.

test_math_gamma = """
>>> import math
>>> round(math.gamma(2),2)
1.0
>>> round(math.gamma(3),2)
2.0
>>> round(math.gamma(4),2)
6.0
>>> round(math.gamma(5),2)
24.0

>>> round(math.gamma(.5), 7)
1.7724539
>>> round(math.sqrt(math.pi), 7)
1.7724539
>>> round(math.gamma(1.5), 7)
0.8862269
>>> round(math.sqrt(math.pi)/2, 7)
0.8862269
"""

This looks good, also. For the given test cases, it's as accurate as our hybrid shown above.

Cumulative Distribution Function

The real calculation of the CDF from a χ2 value, x, and the degrees of freedom, k.

def cdf(x: float, k: int) -> float:
    """χ² cumulative distribution function.

    :param x: χ² value -- generally sum (obs[i]-exp[i])**2/exp[i]
        for parallel sequences of observed and expected values.
    :param k: degrees of freedom >= 1; generally len(data)-1

    From http://en.wikipedia.org/wiki/Chi-squared_distribution

    >>> round(cdf(0.004, 1), 2)
    0.95
    >>> round(cdf(10.83, 1), 3)
    0.001
    >>> round(cdf(3.94, 10), 2)
    0.95
    >>> round(cdf(29.59, 10), 3)
    0.001
    >>> expected=[0.95, 0.90, 0.80, 0.70, 0.50, 0.30, 0.20, 0.10, 0.05, 0.01, 0.001]
    >>> chi2= [0.004, 0.02, 0.06, 0.15, 0.46, 1.07, 1.64, 2.71, 3.84, 6.64, 10.83]
    >>> act= [round(x,3) for x in map(cdf, chi2, [1]*len(chi2))]
    >>> act
    [0.95, 0.888, 0.806, 0.699, 0.498, 0.301, 0.2, 0.1, 0.05, 0.01, 0.001]

    From http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm

    >>> round(cdf(19.18, 6), 5)
    0.00387
    >>> round(cdf(12.5916, 6), 2)
    0.05

    From http://www.itl.nist.gov/div898/handbook/prc/section4/prc46.htm
    >>> round(cdf(12.131, 4), 5)
    0.01639
    >>> round(cdf(9.488, 4), 2)
    0.05

    """
    return 1-gamma(k/2, x/2)/Gamma_Half(k/2)

The calcuation is 1 minus the ratio the partial gamma to the full gamma.

Unit Test Cases

We'll use the doctest comments in each function defined above. Additionally, we'll use some strings with doctest test cases.

__test__ = {
    "test_math_gamma": test_math_gamma,
}

def test(*args, **kw):
    import doctest
    doctest.testmod(*args, **kw)

if __name__ == "__main__":
    test()