*by Shaheen Gauher, PhD, Data Scientist at Microsoft*

At the heart of a classification model is the ability to assign a class to an object based on its description or features. When we build a classification model, often we have to prove that the model we built is significantly better than random guessing. How do we know if our machine learning model performs better than a classifier built by assigning labels or classes arbitrarily (through random guess, weighted guess etc.)? I will call the latter *non-machine learning classifiers* as these do not learn from the data. A machine learning classifier should be smarter and should not be making just lucky guesses! At the least, it should do a better job at detecting the difference between different classes and should have a better accuracy than the latter. In the sections below, I will show three different ways to build a non-machine learning classifier and compute their accuracy. The purpose is to establish some baseline metrics against which we can evaluate our classification model.

In the examples below, I will assume we are working with data with population size \(n\). The data is divided into two groups, with \(x\%\) of the rows or instances belonging to one class (labeled positive or \(P\)) and \((1-x)\%\) belonging to another class (labeled negative or \(N\)). We will also assume that the majority of the data is labeled \(N\). (This is easily extended to data with more than two classes, as I show in the paper here). This is the ground truth.

Population Size \(=n\)

Fraction of instances labelled positive \(=x\)

Fraction of instances labelled negative \(=(1-x)\)

Number of instances labelled positive \((P)\) \(=xn\)

Number of instances labelled negative \((N)\) \(=(1-x)n\)

The confusion matrix with rows reflecting the ground truth and columns reflecting the machine learning classifier classifications looks like:

## Non-machine learning classifiers

We can define some simple non-machine learning classifiers that assign labels based simply on the proportions found in the training data:

**Random Guess Classifier**: randomly assign half of the labels to \(P\) and the other half as \(N\).**Weighted Guess Classifier**: randomly assign \(x\%\) of the labels to \(P\), and the remaining \((1-x)\%\) to \(N\)**Majority Class Classifier**: assign all of the labels to \(N\) (the majority class in the data)

The confusion matrices for these trivial classifiers would look like:

The standard performance metrics for evaluating classifiers are **accuracy**, **recall** and **precision**. (In a previous post, we included definitions of these metrics and how to compute them in R.) In this paper I algebraically derive the performance metrics for these non-machine classifier. They are shown in the table below, and provide baseline metrics for comparing the performance of machine learning classifiers:

```
| Classifier | Accuracy | Recall | Precision |
| -------------- | ---------- | ------ | --------- |
| Random Guess | 0.5 | 0.5 | x |
| Weighted Guess | x
```^{2 }+ (1-x)^{2 }| x | x |
| Majority Class | (1-x) | 0 | 0 |

In this experiment at the Cortana Analytics Gallery you can follow a binary classification model using Census Income data set to see how the confusion matrices for the models compare with each other. An accuracy of 76% can be achieved simply by assigning all instances to majority class.

Fig. Showing confusion matrices for a binary classification model using Boosted Decision Tree for Census Income data and models based on random guess, weighted guess and all instances assigned to majority class. The experiment can be found here.

For a **multiclass classification** with \(k\) classes with \(x_i\) being the fraction of instances belonging to class \(i\) ( \(i\) = 1 to \(k\)), it can similarly be shown that for

**Random Guess**: Accuracy would by \(\frac{1}{k}\). The precision for a class \(i\) would be \(x_i\), the fraction of instances in the data with class \(i\). Recall for a class will be equal to \(\frac{1}{k}\). In the language of probability, the accuracy is simply the probability of selecting a class which for two classes (binary classification) is 0.5 and for \(k\) classes will be \(\frac{1}{k}\).

**Weighted Guess**: Accuracy is equal to \(\sum_{i=1}^k x_{i}^2 \). Precision and Recall for a class is equal to \(x_i\), the fraction of instances in the data with the class \(i\). In the language of probability, \(\frac{xn}{n}\) or \(x\) is the probability of a label being positive in the data (for negative label, the probability is \(\frac{(1-x)n}{n}\)or \((1-x)\) ). If there are more negative instances in the data, the model has a higher probability of assigning an instance as negative. The probability of assigning an instance as true positive by the model will be \(x*x\) (for true negative it is \((1-x)^2\). The accuracy is simply the sum of these two probabilities.

**Majority Class**: Accuracy will be equal to \((1-x_i)\), the fraction of instances belonging to the majority class (assumed negative label is majority here). Recall will be 1 for the majority class and 0 for all other classes. Precision will be equal to the fraction of instances belonging to the majority class for the majority class and 0 for all other classes. In the language of probability, the probability of assigning a positive label to an instance by the model will be zero and the probability of assigning a negative label to an instance will be 1.

*We can use these simple, non-machine learning classifiers as benchmarks with which to compare the performance of machine learning models.*

**MRS code** to compute the baseline metrics for a classification model (Decision Forest) using Census Income data set can be found below. The corresponding **R cod**e can be found here.

Acknowledgement: Special thanks to Danielle Dean, Said Bleik, Dmitry Pechyoni and George Iordanescu for their input.

Great article. How do I get a hard copy of it?

Posted by: John Fresen | March 25, 2016 at 02:32

Thanks John. You can find the link to the paper with the derivations in the article.

https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf

Will be happy to answer additional questions if any.

Posted by: Shaheen Gauher | March 25, 2016 at 12:17