Non-Parametric Hypothesis Testing for Comparing Machine Learning Algorithms

6 min readJun 27, 2021

Today’s statistics provide the basis for inference in most medical, industrial and financial research. However, most of machine learning practitioners and researchers are only familiar to some degree with descriptive statistical measures, for instance mean, variance, correlation and central tendency.

In the same time, most of them fail in establishing inferential statistics and hypothesis testing.

If you don’t know how to use statistical test for comparing machine learning methods. Then this article if for you.

Why it is important to do statistical test and why do we need it?

Generally, statistical tests are required to ensure that data are interpreted correctly and that apparent relationships are meaningful (or “significant”) and not simply chance occurrences.

More precisely, a statistical test provides a mechanism for making quantitative decisions about a process or many processes, where the objective is to determine whether there is enough evidence to “reject” a conjecture or hypothesis about the process. The conjecture is called the null hypothesis.

The concept of null hypothesis and alternative hypothesis

Here, we explain the concepts of null hypothesis and alternative hypothesis, with focus on machine learning applications.

In general, the null hypothesis proposes that there is no difference between the study groups (models or algorithms). In the contrary, the alternative hypothesis proposes that there is a significant difference between the study groups (models or algorithms).

Examples of Null Hypothesis and Alternative Hypothesis

Accuracy Forecast

Null Hypothesis: Model X does not predict better than the existing model Y.
Alternative Hypothesis: Model X predicts better than the existing model Y.

Recommendation Engine

Null Hypothesis: Algorithm B does not produce better recommendations than the current algorithm A being used.
Alternative Hypothesis: Algorithm B produces better recommendations than the current algorithm A being used.

Regression Modeling

Null Hypothesis: This variable does not affect the outcome because its coefficient is zero.
Alternative Hypothesis: This variable affects outcome because its coefficient is not zero.

Reject or Accept a Hypothesis

After properly defining H0 and H1. In the next, by using a hypothesis test on the obtained results, we can compute a p-value (probability value), which quantify the idea of statistical significance. And therefore, we can either reject or accept the null hypothesis, by comparing the test result p-value with a doubt threshold (alpha), as follows:

If p-value > alpha : we fail to reject H0.
If p-value < alpha : we have sufficient evidence to reject H0.

alpha is a pre-defined threshold value (level of significance), depends on the field of study.

Note: In a statistical test, the null hypothesis H0 is considered valid until enough data proves it wrong. If data does not give us enough proof to reject that H0, this does not automatically prove that H0 is correct.

Types of statistical tests

Parametric Test

Is the hypothesis test which provides generalizations for making statements about the mean of the population. This type rests on the underlying assumption that there is the normal distribution of variable and the mean in known or assumed to be known.

Non-parametric Test

Known as the distribution-free test, which is defined as the hypothesis test which is not based on underlying assumptions, i.e. it does not require population’s distribution to be denoted by specific parameters (based on differences in medians).

An exhaustive taxonomy of hypothesis testing methods is given bellow.

Non-Parametric Tests

According to the number of methods, algorithms or classifiers, which are used in the comparison, the non-parametric tests could be divided into two main categories.

Tests for comparing two classifiers over multiple data sets

Paired t-test
Wilcoxon Signed-Ranks Test
Sign test: counts of wins, losses and ties

Tests for comparing multiple classifiers over multiple data sets

ANOVA
Friedman Test

In this article, we will focus only on the technique of Wilcoxon Signed-Ranks Test for comparing two classifiers over multiple data sets, due to their high importance and usefulness.

Wilcoxon Signed-Ranks Test

When to use it?

No assumption on data to be normally distributed.
To compare two repeated measurement on a single sample.
The methods and measurements are independent.

Sate of the Null Hypothesis and Alternative Hypothesis

H0: the median difference between two paired data (two classifiers) is equal to zero.
H1: the median difference between the two paired data (two classifiers) is not equal to zero.

How it works ?

First we should compute a Test Statistic (T) and compare it with a critical value (T_crit). Which is computer as follows:

T_crit = Signed-rank-Table(#NBsamples, alpha).

Then we decide if we should reject or accept the null hypothesis.

If T > T_crit : fail to reject H0.
If T < T_crit : Sufficient evidence to reject H0.

Application Example

Let’s consider that we want to choose between two machine learning models , Model A and Model B, according to their classification accuracy (%) on several test databases (benchmark sets). The objective is to choose which one is going to be deployed and used in the production environment .

First, we state our null hypothesis and alternative hypothesis as:

H0: There is no difference between the two models A and B.
H1: There is a difference between the two models A and B (the median change was non-zero).

Here, we give the table of results for each model with respect to each test dateset.

Classification accuracy of **Model A** and **Model B** on different test sets

Now, we start the Wilcoxon Signed-Ranks Test process.

Step 1: Compute the differences between the two methods:
Diff = Model A - Model B.

Step 2: Compute the sign of the differences (negative -1 or positive 1): Sign(Diff).

Step 3: Compute the absolute value of the differences: |Diff|.

Step 4: Rank the absolute values of differences.

The lowest value getting the rank of 1, the second lower value get rank 2, the third lower get rank 3, etc. In case of ties average ranks are assigned.

Step 5: Compute the signed ranks as Rank * Sign(Diff).

Step 6: Compute W+ as the sum of positive ranks, W- as the sum of negative ranks, and Test Statistic (T) as the minimum of |W+| and |W-|.

Step 7: Extract the Test Critical value T_crit, for a significance level alpha = 0.05 and n=#Nbdatabase -1=8, from Signed Ranks Table.

According to the table we get T_crit=3.

Step 8:Compare the Test Statistic (T) with the Test Critical value T_crit

T=17 > T_crit=3 : we fail to reject H0.

As a conclusion of the test results. We can conclude that there are no sufficient evidence to suggest that there is a difference between the two methods in terms of classification accuracy.

What we are going to do to make the results significant ?
Add more datasets (9 is very small) or change the level of significance (alpha). Try or design a new method/model, try optimal parameter selection, etc.

What’s next ?

I will discuss in my next articles:

How to use to Friedman Test for comparing multiple classifiers over multiple data sets.
How to use Post-hoc Test to derive further evidence about test results.