Data Complexity Measures

rachid benouini
4 min readJul 3, 2021

Introduction

It is well known that different classifiers give different performances for different datasets. Because, the behavior of classifiers strongly depends on the available data and on the intrinsic data characteristics.

Since, for a given problem, it is very difficult to guess which classifier will provide the best performance. The common practice is still to evaluate and compare the accuracy of several classification algorithms, then choosing one or a set of them according to the experiment, without analyzing why a classifier outperforms other classification algorithms on this specific problem.

In the meanwhile, it is widely accepted that any machine learning algorithm can be affected by different factors from the data, like the shapes of the classes, the shape of the decision boundary, the amount of overlap among classes, the proximity among classes, etc.

Therefore, it is important to study the data set characteristics as it has a high impact on the performance of any classification algorithm along with its algorithmic parameters.

Data complexity analysis is a set of tools and techniques that aims to measure the aforementioned factors. This kind of information about data complexity could be of very importance while designing machine learning solutions and could be used to support the election of the proper classification algorithm.

However, the practice of data complexity analysis is often ignored.

If you don’t know how to perform data complexity analysis, then this article is for you.

Data Complexity Measures

The general idea consists of predicting the applicability and performance of a classifier based on certain data characteristics. To this end, one could employ a set of data complexity measures, usually concerning statistical, geometrical and information theoretic descriptions.

Those metrics describe the regularities and irregularities contained in the data set .

The measures considered can be divided into three categories as follows:

  • Measures of overlaps in feature values from different classes
  • Measures of separability of classes
  • Measures of geometry, topology, and density of manifolds

Measures of overlaps in feature values from different classes

These measures mainly focus on the effectiveness of a single feature dimension in separating the classes. They examine the range and spread of values in the data set with respect to each feature, and check for overlaps among different classes.

Maximum Fisher’s discriminant ratio

It compares the difference between class means with the sum of class variances.

Volume of overlap region

Computing the maximum and minimum of each feature then the length of the overlap region between classes

Maximal (individual) feature efficiency

Measure of how much each feature contributes to the separation of the classes.

Measures of separability of classes

These measures illustrate what extent two classes are separable by examining the existence and shape of the class boundary.

Minimized sum of error distance by linear programming

Small values error indicates that the data set is linearly separable.

Error rate of linear classifier

Measure the linear separability on the original training set.

Fraction of points on class boundary

Counting the number of points incident to an edge which goes across the two classes.

Measures of geometry, topology, and density of manifolds

These measures offer indirect characterizations of class separability. The shape, position, and inter-connectedness of these manifolds give hints on how well two classes are separated, but they do not describe separability by design.

Non-linearity of linear classifier

Calculated by means of the creation of a test set by linear interpolation between randomly pairs of points from the same class, given a training set.

Non-linearity of 1 Nearest Neighbor classifier

This metric is measured trough computing the error rate of the 1 Nearest Neighbor classifier.

Fraction of points with associated adherence subsets retained

Counts the number of biggest adherence (ball centered on a data point) subsets needed to cover each class.

What’s next ?

In the second part two we will see:

  • How to use complexity measures to choose the appropriate classifier for data.
  • The effect of data complexity on each classifier.
  • What complexity metrics could offer significant information and insight about the data.

Resources

  • Cano, J. R. (2013). Analysis of data complexity measures for classification. Expert systems with applications, 40(12), 4820–4831.
  • Jain, S., Shukla, S., & Wadhvani, R. (2018). Dynamic selection of normalization techniques using data complexity measures. Expert Systems with Applications, 106, 252–262.
  • Sotoca, J. M., Sanchez, J. S., & Mollineda, R. A. (2005). A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA, 77–83.

--

--