Hypothesis Testing in Lean Six Sigma: How to Choose the Right Test

Hypothesis testing in Lean Six Sigma is how practitioners separate real differences from random noise. When you change a process and the average defect rate drops from 4.2 percent to 3.8 percent, is that a genuine improvement or just sampling variation? Hypothesis testing answers that question with a probability statement, and choosing the right test for the situation is the skill that separates a Black Belt from a beginner.

This article explains the logic of hypothesis testing, walks through the main test families and when to use each, and highlights the mistakes that cause projects to draw wrong conclusions from right-looking data.

The Logic of Hypothesis Testing

Every hypothesis test starts with two competing claims. The null hypothesis, written H0, is the default position: nothing has changed, the groups are the same, the factor has no effect. The alternative hypothesis, written H1 or Ha, is the position we are trying to find evidence for: the change worked, the groups differ, the factor matters.

The test produces a p-value, which is the probability of seeing data at least as extreme as what we observed if the null hypothesis were true. A small p-value (conventionally below 0.05) means the data are unlikely under the null, and we reject the null in favour of the alternative. A large p-value means we fail to reject the null, which is not the same as proving it true. The distinction matters and is the source of much published confusion.

The ASQ overview of Six Sigma methodology places hypothesis testing firmly in the Analyse phase of DMAIC, where most of the inferential statistics live.

The Two Main Errors

Hypothesis testing can go wrong in two ways. A Type I error (alpha error) is rejecting the null when it is actually true: you conclude the change worked when it did not. The conventional significance level of 0.05 means we accept a 5 percent risk of this. A Type II error (beta error) is failing to reject the null when the alternative is actually true: you conclude the change did not work when it did. Power, the probability of correctly rejecting a false null, is one minus the Type II error rate.

In Lean Six Sigma projects, Type II errors are often the costlier mistake. A change that genuinely works gets discarded because the sample was too small to detect the effect. This is why power calculations and sample size planning are essential before any test is run.

Choosing the Right Test

The decision tree for test selection depends on three questions. What type of data do you have (continuous or attribute)? How many groups are you comparing? What is the structure of your data (independent samples, paired samples, multiple factors)?

For continuous data

One-sample t-test: compare a single sample mean to a target value.
Two-sample t-test: compare the means of two independent groups, for example two suppliers.
Paired t-test: compare two measurements on the same items, for example before-and-after on the same units.
One-way ANOVA: compare the means of three or more groups.
Two-way ANOVA: examine the effects of two factors on a continuous outcome, including their interaction.

For attribute data

One-proportion test: compare a single sample proportion to a target.
Two-proportion test: compare proportions between two groups, for example pass rates between shifts.
Chi-square test of independence: examine the relationship between two categorical variables.
Chi-square goodness-of-fit test: examine whether observed frequencies match expected ones.

For non-normal continuous data

Mann-Whitney U: non-parametric equivalent of the two-sample t-test.
Wilcoxon signed-rank: non-parametric equivalent of the paired t-test.
Kruskal-Wallis: non-parametric equivalent of one-way ANOVA.

The Assumptions That Matter

Every parametric test has assumptions, and ignoring them invalidates the conclusion. The classic assumptions for t-tests and ANOVA are: data are continuous, the samples are independent, the population distributions are approximately normal, and the variances are roughly equal (homoscedasticity).

Normality is checked with a probability plot or an Anderson-Darling test. Equal variances is checked with Bartlett’s test if the data are normal, or Levene’s test if they are not. Independence is checked by understanding how the data was collected, which is where many automated checks fail: there is no statistical test that can tell you whether two observations are truly independent if the sampling scheme produced correlated readings.

When assumptions fail, the choice is between transforming the data (a Box-Cox or log transformation often helps with skewness), using a non-parametric test, or in advanced cases, using a bootstrap procedure. The default of pressing on with a violated assumption is the worst option, though regrettably common.

Statistical Significance vs Practical Significance

A p-value below 0.05 is necessary but not sufficient for a project conclusion. Statistical significance means the observed difference is unlikely to be due to chance. Practical significance means the difference is large enough to matter to the business. With a large enough sample, almost any difference will be statistically significant, including ones too small to justify the cost of implementation.

Confidence intervals address this far better than p-values alone. A confidence interval on the difference between two means tells you both whether the difference is significant (the interval excludes zero) and how large the difference plausibly is. Reporting both the p-value and the confidence interval is the modern best practice.

For a worked example with Minitab screenshots, the Lean Sigma Corporation ANOVA guide walks through the full process step by step.

Hypothesis Testing Inside DMAIC

Most hypothesis tests are run in the Analyse phase, where the team is validating proposed root causes. A typical project will run several tests: comparing performance between shifts, between machines, between suppliers, between locations. The output is a short list of factors that demonstrably affect the outcome, ranked by effect size.

Hypothesis testing also appears in the Improve phase, when comparing pilot results to the baseline, and occasionally in the Control phase, when a special cause investigation needs formal evidence rather than visual inspection.

Common Mistakes

Treating the p-value as the probability that the null is true. It is not. It is the probability of the data given the null is true.
Running multiple tests on the same data without adjusting the significance level. The probability of at least one false positive grows quickly with the number of tests.
Skipping the assumption checks because the software did not complain. Software does not refuse to compute a t-test on non-normal data.
Concluding ‘no effect’ from a non-significant result on a small sample. Absence of evidence is not evidence of absence.
Forgetting to plan the sample size before collecting data.

Black Belt candidates encounter all of these in formal training. The ILSSI Black Belt programme devotes substantial time to inferential statistics, including power analysis, multiple comparisons, and the design of experiments.

A Worked Example

A Green Belt project at a regional bank investigated whether a new training programme reduced data entry errors among call centre agents. The team measured error rates for 30 agents before training and again for the same agents after training. Because the data were paired (same agents, before and after), the appropriate test was a paired t-test, not a two-sample t-test.

The result was a p-value of 0.003 and a 95 percent confidence interval on the mean reduction in error rate of 0.8 to 2.4 percentage points. The interval excluded zero, confirming statistical significance, and the lower bound of 0.8 percentage points was judged practically significant by the project sponsor. The team progressed to the Improve and Control phases.

The deliberate choice of the paired test, rather than the two-sample test, made the analysis sensitive to the genuine effect. A two-sample test would have lost power by ignoring the agent-level pairing, and might have failed to reach significance even with the same underlying improvement.

Final Thoughts

Hypothesis testing is the connective tissue between intuition and evidence in Lean Six Sigma. The mechanics are not difficult, and the software does the arithmetic. The judgment lives in choosing the right test, checking the assumptions, interpreting the result honestly, and translating the statistics back into language the sponsor can act on. That judgment takes practice, and there is no substitute for running real tests on real project data.

To deepen your statistical skills within an accredited framework, explore the ILSSI Green Belt certification and the range of ILSSI courses available globally.