Statistics Case Study

This is a case study of using χ² in an EDA context. See http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm

Here's the background information from this example:

A total of 309 wafer defects were recorded and the defects were classified as being one of four types, A, B, C, or D. At the same time each wafer was identified according to the production shift in which it was manufactured, 1, 2, or 3.

We have a table of defects by shift and type. Each row is a different shift, each column is a different type.

defects = [
    [15, 21, 45, 13],
    [26, 31, 34,  5],
    [33, 17, 49, 20],
]

We can imagine this is the result of summarizing raw data. The raw data migth look like this:

shift,defect,serial
1,None,12345
1,None,12346
1,A,12347
1,B,12348
etc. for thousands of wafers

The summary table is the result of something like a collections.Counter() using shift,defect as the key.

Or perhaps the results come from a SQL query like this:

SELECT SHIFT, DEFECT, COUNT(*) FROM some tables
...
GROUP BY SHIFT, DEFECT
HAVING DEFECT NOT NULL;

Before starting the analysis, we'll need to import some libraries

from chi_sq import cdf
from fractions import Fraction

Basic EDA Analysis

We need to see if the effects are random or not. If the effects are random variation, our hypothesis is that there's nothing to see here. This is the statistician's "null hypothesis:" nothing interesting is happening.

To evaluate the null hypothesis, we compare the observed defect data against some reasonable expectations for the defect data.

What we'll do is determine a hypothetical allocation of the same total number of defects that has a similar distribution based on shift and type.

The total number of defects is stds, t = 309.

total = sum(map(sum, defects))

We need to compute expected defects by shift and type. The null hypothesis asserts that all effects are random, so the efects should be evenly spread across shift and type.

Defects By Shift

The shift totals are defined as {tdst0 ≤ s < 3}.

shift_total = [sum(defects[s][t] for t in range(4)) for s in range(3)]

The total number of defects by shift are [94, 96, 119].

We can also calculate the totals like this, but that doesn't generalize well to for computing sum across types.

list(map(sum, defects))

Here are the probabilities of a defect based on actual counts of defects by shift.

P_shift = [Fraction(s,total) for s in shift_total]

We get this as a value [Fraction(94, 309), Fraction(32, 103), Fraction(119, 309)].

We've materialized two list objects here. Since we'll be producing some intermediate output, the materialized collections are helpful. If we did not intend to produce intermediate output, we could use lazy generator functions to reduce the amount of memory required.

Defects By Type

Computing the type totals isn't quite so simple as computing the shift totals because we're slicing the matrix on an axis that doesn't work trivially with the sum() function.

Here are the type totals: {sdst0 ≤ t < 4}.

type_total = [sum(shift[t] for shift in defects) for t in range(4)]

The values are [74, 69, 128, 38].

Here are the probabilities of a defect based on actual counts of defects by type.

P_type = [Fraction(t,total) for t in type_total]

The values are [Fraction(74, 309), Fraction(23, 103), Fraction(128, 309), Fraction(38, 309)].

Combined Expectations

The overall defect probabilities are Pij = Pss × Ptt for each shift and type. This reflects the null hypothesis that neither shift nor type of defect has any bearing on the values: they're all just random.

We can multiplie the probability by the total number of defects to compute how many defects belong to each combination of shift and type.

expected = [
    [float(ps*pt)*total for pt in P_type]
    for ps in P_shift
]

We can format these results like this to make something understandable:

r2 = lambda x:round(x,2)

Here's a list-of-ist summary. .. parsed-literal:

list(list(map(r2, row)) for row in expected)

This structure shows following expected values for defect counts

[
   [22.51, 20.99, 38.94, 11.56],
   [22.99, 21.44, 39.77, 11.81],
   [28.5, 26.57, 49.29, 14.63]
]

Each cell is based on a contribution of shift and type to the overall number of defects.

Note the imporant difference from the expectation that the number of defects would be (309)/(12) = 25.75 in all cases. We're not asserting that shift and type of defect have no effect. We're asserting that the effects are independent of each other.

For example, the different shifts may have different numbers of workers. It might be perfectly reasonable for one shift to produce more wafers and therefore more defects. Using actual shift totals to compute the probabilities reflects the idea that the observed counts will vary.

Displaying the Contingency Table

We can display the observed and expected in a single table. This involves a bit of restructuring the data to show defects and expected values side-by-side.

print("obs exp    "*4)
for s in range(3):
    pairs = '  '.join(
        f"{defects[s][t]:3d} {expected[s][t]:5.2f}" for t in range(4))
    print(f"{pairs}  {shift_total[s]:3d}")
footer = '        '.join(f"{type_total[t]:3d}" for t in range(4))
print(f"{footer}        {total:3d}")

For each of the three shifts, we produced a row of data. Each row was pairs of observed and expected organized by defect type.

The output looks like this:

obs exp    obs exp    obs exp    obs exp
 15 22.51   21 20.99   45 38.94   13 11.56   94
 26 22.99   31 21.44   34 39.77    5 11.81   96
 33 28.50   17 26.57   49 49.29   20 14.63  119
 74         69        128         38        309

This shows the observed defects and the expected defects. It shows the shift totals and the defect type totals, also.

In many practical applications, producing a text report like this isn't really optimal. It's often nicer to write a CSV file that can be reformatted for display. Using tools like Pandas xlsx writer can provide more useful output.

Here's the CSV version of the output. It requires a function to properly flatten and interleave defects and expected values.

def flatten(defects, expected=None):
    for t in range(4):
        yield defects[t]
        if expected:
            yield expected[t]
        else:
            yield None

This is similar in philosophy to the itertools.zip_longest() function. It interleaves values from defects and an optional second iterable. It can also be used to emit headers and footers; these situations don't have both iterables.

import csv

with open("contigency.csv", "w", newline="") as output:
    wtr=csv.writer(output)
    wtr.writerow(["shift"]+list(flatten(("A","B","C","D")))+["total"])
    wtr.writerow(["","obs","exp","obs","exp","obs","exp","obs","exp"])
    for s in range(3):
        row= [s]+list(flatten(defects[s],expected[s]))+[shift_total[s]]
        wtr.writerow(row)
    row= ["total"]+list(flatten(type_total))+[total]
    wtr.writerow(row)

We've opened a writer and put out two lines of titles as a heading. The list(flatten(("A","B","C","D"))) produces eight values by interleaving the four defect types and an equal number of None values.

The body includes the interleaved defect counts and expected counts. Each row has a shift number as a header and a shift total as a summary.

The total line uses list(flatten(type_total)) to interleave the defect type totals with None values to create a footer line.

Based on the totals, the third shift is expected to be more productive than the first two. Similarly, defect type "C" is expected to show up considerably more often than defect type "D".

The question is "do the cell details reflect the overall summaries?"

Applying the χ² test

The final χ² value involves st((Est − dst)2)/(Est)

We have three broad design patterns for workting with parallel nested structures:

  1. Sum of sums. This involves a sum(map(sum, iterable)) expression. Since we have two structures that must be compared, this is really a bit more complex than it appears. We need to compute the differences before summing.
  2. Use all combinations of index values.
  3. Flatten both expected and actual and use a single sum.

The second approach is how we produced the contingency table, above.

We'll look at the third approach, also.

Both will benefit from a handy lambda that allows us to compute the squared difference value between expected, e, and observed, o.

diff = lambda e,o: (e-o)**2/e

We can use this lambda with all combinations of indices like this.

chi2 = sum(
    diff(expected[s][t], defects[s][t])
    for s in range(3)
    for t in range(4)
)

We can also apply this lambda to two flattened sequences like this:

chi2 = sum(
    map(diff,
        (e for shift in expected for e in shift),
        (o for shift in defects for o in shift),
    )
)

Either version gets us a χ² value, chi2, of 19.18. There are six degrees of freedom in this model: 3-1=2 shifts times 4-1=3 types.

print(f"χ² = {chi2:.2f}, P = {cdf(chi2, 6):.5f}")

The output looks like this:

χ² = 19.18, P = 0.00387

The probability of this data being random is very low. There's something going on here that deserves further investigation. In formal terms, we must reject the null hypothesis.

Functional Python Notes

Generally, each step has been defined by functional programming design patterns. We've accumulated totals, transformed them to probabilities, and then transformed the probabilities into expected values. We computed the χ² and compared this with the χ² cumulative distribution function to measure the randomness of the data.

Each stage has been done with generator expressions and higher-order functions.

In order to produce a printed report or CSV output file, we had to step back from purely functional programming. In these two cases, we used Python imperative programming techniques to structure the output in the intended format.