We can model a calibrated classifier as a random variable , whose support is a subset of . Accordingly, if we denote the probability density function as and the cumulative distribution function as , then given a threshold, , we can define the expected confusion matrix as:

predicted cond /
true cond
true false
true TP: FP:
false FN: TN:

Given these values, we can compute any desired metric one might find in the Precision and recall Wikipedia article.


It should be easy to see and follows from the definitions that the following are true:

If we think about the definition of , it is the expected value of scores that exceed the threshold, , or:

Note that the upper integration limit of one comes from the fact that the support for is upper bounded by one since calibrated classifiers output probabilities. is simply .

is defined similarly to , since it is an estimate of true values that are less than :

The only differences are the integration limits. Note the lower integration limit follows from the fact that calibrated classifiers output probabilities (which must be non-negative). .

Metrics based on a sample of scores

Given a sample of scores generated by a calibrated classifier, we can use the confusion matrix formulae above by linearly scanning the scores and accumulating the sum and count of scores that exceed and do not exceed the threshold.

variable definition
sum of scores that exceed the threshold
count of scores that exceed the threshold
sum of scores that do not exceed the threshold
count of scores that do not exceed the threshold

If desired, each of these can be divided by the sample size to ensure that:

this will make and true positive and false negative rates. It will also make and related to the CDF, but this normalization will not have an effect on computing any of the metrics based on ratios of combinations of  and .


To illustrate the ability to estimate classification metrics for a calibrated classifier, we devise the following series of experiments. Let be a probability distribution representing a calibrated classifier (i.e., with support in ). Sample random variates from and for each variate, , perform a weighted coin flip with positive probability . These weighted coin flips are draws from distributions. Define as the probabilities drawn from and as the associated Boolean values drawn from .

To compute the (yt, yp) column in table 1, we use and and determine the metrics using as the threshold (i.e., the decision boundary) values. These metrics are computed using the *_score methods from scikit-learn’s sklearn.metrics module. The yp column in table 1 is computed using the methodology outlined in the Metrics based on a sample of scores section above. Finally, the dist column is determined by applying the desired metrics to the confusion matrix values computed with the formulae in the introduction. In table 1, this process is repeated for several distributions with .

Table 1: By distribution and calculation method

  (yt, yp) yp dist
U(0, 1) U(0, 1) U(0, 1)
x2 x2 x2
Beta(0.2, 0.3) Beta(0.2, 0.3) Beta(0.2, 0.3)
Beta(2, 3) Beta(2, 3) Beta(2, 3)

Table 2: By distribution and metric

U(0, 1) F(x) = x2 β(0.2, 0.3) β(2, 3)
U(0, 1) accuracy x2 accuracy Beta(0.2, 0.3) accuracy Beta(2, 3) accuracy
U(0, 1) precision x2 precision Beta(0.2, 0.3) precision Beta(2, 3) precision
U(0, 1) recall x2 recall Beta(0.2, 0.3) recall Beta(2, 3) recall
U(0, 1) f1 x2 f1 Beta(0.2, 0.3) f1 Beta(2, 3) f1


When comparing plots for each distribution in table 1, notice the yp plots are smoother than the (yt, yp) plots and the dist plots are smoother than the yp plots. Metrics based only on yp can be thought of like metrics in (yt, yp) except that , rather than 1, coin flips are drawn from each . The dist metrics can be thought of like the metrics based on yp as .

Also note that metric values based on yp seem to approach the dist metric values. It seems that this convergence is predicted to occur with high probability according to the Glivenko–Cantelli theorem (1933).

One additional point: since the confusion matrix estimates can be determined at any threshold, curves like precision-recall curves and ROC curves can be determined parametrically as can metrics derived from these curves.


If you have reason to believe that a classifier is calibrated (e.g., it was explicitly calibrated), then classification metrics can be directly computed from the classifier’s scores without the need for ground truth data. While this may not be a perfect solution, it provides a good back-of-the-napkin estimate. If the distribution is known, the classification metrics can be computed analytically from the distribution’s CDF and expectations over the intervals and .

Appendix: Code