2. Materials and Methods
Sanchez-Marquez et al. [
8] derived an intrinsic kappa coefficient for dichotomous categories, with its point estimate expressed as
where
is the point estimate of the intrinsic kappa coefficient,
is the type-I error (the proportion of non-defective units misclassified), and
is the type-II error (the proportion of defective units misclassified). Significantly, this coefficient does not depend on the percentage of instances of each category, giving us a robust measure unaffected by class prevalence. Regretfully, it can be applied only to dichotomous problems. This paper aims to develop a generalised intrinsic kappa coefficient applicable to problems with any number of categories.
In
Figure 1, the confusion matrix is contained inside the bold square. The labels
A, B ..., and NC represent the categories used to classify each instance or unit. Every instance is assigned a label from one of these categories. The labelling process occurs twice. The first labelling is conducted beforehand and is known as the ’known standard.’ The initial label serves as a reference or benchmark to train and validate the classification system, whether it involves a machine learning algorithm or a group of human raters. The classification system itself performs the second labelling. Comparing these two labels reveals the system’s performance. In this context, performance metrics such as the kappa coefficient quantitatively measure the system’s classification accuracy. In this matrix and the following lines formulae,
m represents the number of repetitions a particular object or instance is evaluated and classified. Therefore,
m only makes sense in the context of quality inspections but not in machine learning since every time an instance is classified, it will be classified as belonging to the same category; thus, in machine learning,
m = 1. The notation
P(i/j) represents the probability or percentage of instances or units classified as belonging to class
i when they belong to
j, with
i and
j ranging from
A to
NC. Similarly,
Xi/j indicates the number of instances or units classified in this manner. It is important to note that instances are correctly classified along the matrix diagonal i = j. The number of units known to belong to class
i is denoted by
ni; finally,
N represents the total number of instances.
Figure 1 shows a confusion matrix for generalising the intrinsic kappa coefficient first developed by Sanchez-Marquez et al. [
8]. For simplification,
NC stands for the number of categories and the name of the last category. According to Sanchez-Marquez et al. [
8], for deriving the intrinsic kappa coefficient, which should not depend on the proportion of units belonging to each category, we must set up the sample as balanced and assume the hypothesis that we will obtain an expression that does not depend on the proportion of units in each category. However, this does not imply that the sample must be balanced. Instead, this approach leads to a kappa coefficient expression independent of sample prevalence, yielding the intrinsic kappa coefficient. As noted earlier, Sanchez-Marquez et al. [
8] were the first to derive the intrinsic kappa coefficient for dichotomous problems, employing similar fundamental concepts and methodology. Once the expression for the intrinsic kappa point estimate is derived, we must check if the initial hypothesis is met and then account for the different sample sizes in each category using kappa’s lower bound computed by the F-statistic level of significance for the exact method or the standard error for the approximate one. As explained by Sanchez-Marquez et al. [
8], by forcing the same number of instances in all categories, we obtain a coefficient that does not depend on the prevalence so that it is the value that a balanced experiment would obtain, which is the intrinsic value of kappa. The intrinsic kappa coefficient shows how well the system (or the classifier in machine learning) classifies the instances regardless of the proportion of units in each category. It does not happen with the traditional kappa coefficient, which would change its value without changing the classifier performance by changing the proportion of units belonging to each category. Therefore, by using the expressions of the intrinsic kappa coefficient, we obtain the same result as with the traditional kappa coefficient with a balanced sample but without the need for the sample to be balanced. Once the expression of the point estimate of the intrinsic value of the kappa coefficient has been obtained, it is necessary to consider the estimation error due to the sample size by expressing a confidence interval or limit.
In
Figure 1, the conditional proportions are shown on the top of each cell. Every cell shows a proportion on the top and a count below. For example, the first cell of the matrix in the upper left-hand corner contains the proportion of instances evaluated as belonging to category
A that are known to belong to
A, so
P(A/A). It also contains the count of instances, which, in this cell, is the number of instances evaluated as
A when it is known they belong to
A. Therefore, all cells contain the conditional proportion by column and their corresponding conditional count. The additional cells outside the confusion matrix (outside the bold square) also show the proportion on the top and the count below, except for the lower right-hand corner, which only contains the total count since the overall proportion is one. As mentioned above, the confusion matrix shown in
Figure 1 is built according to the hypothesis that assuming the number of instances belonging to all categories is equal (
nA =
nB = … =
nNC) will allow us to derive the intrinsic kappa coefficient.
According to Everitt [
23], the kappa coefficient is defined as
where
is the proportion of observed agreements, and
is the proportion of expected agreements (which can also be understood as agreements obtained by chance). The observed agreements denoted as
Xo, will be on the diagonal of the confusion matrix, representing the well-classified instances. The off-diagonal counts are the misclassified ones. It is the idea behind the accuracy, which, along with kappa, is one of the most widely used performance metrics in machine learning. For those interested in a deeper understanding, foundational literature ([
23]) delves into the definition and application of the kappa coefficient. The confusion matrix can be summarised using the proportion of well-classified instances. Therefore,
and
.
Using the structure of the data from
Figure 1,
Looking at
Figure 1, we can see that the sum of all conditional proportions inside the confusion matrix is
NC since conditional proportions on each column sum up to one. Therefore,
Substituting (3) in (1) we obtain
The initial hypothesis has been confirmed since, in Eq. (4), the point estimate of kappa does not depend on the prevalence. Therefore, Eq. (4) and Eq. (2) compute the point estimate of the intrinsic kappa coefficient.
It is exciting and intuitive that the intrinsic kappa statistic that considers the probability of classifying well by chance only depends on the number of categories.
Figure 2 shows the behaviour of
for a typical accuracy level—
. It shows that the intrinsic kappa coefficient is penalised when
NC is minimum (
NC = 2). Thus,
is also the minimum. The larger the
NC, the greater the
. The
behaviour shown in
Figure 2 reflects that the larger the number of categories, the more difficult it is to classify well by chance, which is coherent with the definition of the kappa coefficient. Therefore, the intrinsic kappa coefficient considers the probability of classifying well by chance, and it depends on the number of categories, which is more coherent than what happens with the traditional kappa coefficient. The traditional coefficient reflects that classifying well by chance also depends on the proportion of instances belonging to each category [
8]. It means that it is more coherent that the probability of classifying a particular unit or instance well depends only on the number of categories and not on the proportion of units in the sample belonging to each category. The latter does not make sense. Therefore, the intrinsic kappa coefficient better reflects how the system (human or automated) classifies itself than the traditional coefficient.
It is well-known that , therefore Eq. (4) must meet this essential characteristi . It means that when , .
We can derive the above conclusion directly from Eq. (4), which means that it applies to any number of categories:
Sánchez-Marquez et al. [
8] derived the following equation for the case of two categories:
where
is the proportion of wrong-classified non-defective instances, and
is the proportion of wrong-classified defective instances.
If Eq. (5) is a particular case of Eq. (4), Eq. (5) should appear from Eq. (4) if we express Eq. (4) as a function of and . Let us check it.
For two categories and expressing it in terms of
and
, we have that
From Eq. (4),
, that for two categories and expressing it in terms of α and β, we arrive at
Therefore, we have confirmed that Eq. (5) and (4) are equivalent for NC = 2.
As mentioned in the literature [
8], it is essential not to use the point estimate, thus considering the sample size. The following lines derive expressions to estimate the confidence lower bound of the intrinsic kappa coefficient and the accuracy using exact and approximate methods.
2.1. Exact Confidence Lower Bound of the Intrinsic Kappa Coefficient Accuracy for Any Number of Categories
As mentioned above,
is the proportion of the well-classified instances; thus, it is a binomial statistic. Therefore, we can compute the exact confidence lower bound for
k based on the F distribution [
26] [
27]:
where
kLB is the confidence lower bound of the intrinsic kappa coefficient and
is the confidence lower bound of the accuracy.
- -
- -
- -
is the number of wrong classifications.
- -
is the total number of instances.
- -
is the value of the inverse F function for a significance level of α, and and degrees of freedom.
It should be remarked that Eq. (7) computes the accuracy lower bound using the number of failures or wrong-classified instances [
26], which are the instances outside the diagonal elements of the confusion matrix. Therefore, to account for the estimation error caused by the sample size, practitioners and researchers who prefer accuracy as a performance metric must use this expression as a metric performance instead of using the point estimate, which is the common practice so far.
To compute the lower bound of the intrinsic kappa for one category (category i), we must build a confusion matrix for two categories. One category would be that we are interested in computing the kappa lower bound, and the other would summarise the ratings of the rest of the categories. Once we have constructed this two-way table, we apply the same concept as that of Eq. (6) and (7) but for two categories:
where:
where:
- -
is the accuracy lower bound for the i category.
- -
.
- -
.
- -
is the number of wrong-classified instances.
- -
N is the total number of instances.
- -
is the value of the inverse F function for a significance of α, and and degrees of freedom.
- -
is the number of well-classified instances that belong to the i category.
- -
is the number of instances that do not belong to the i-category and are classified as not belonging to that category.
As with the overall performance, to compute the accuracy lower bound of one category, practitioners and researchers must build a two-way table as mentioned above and use Eq. (9) instead of the point estimate.
Agresti & Coull [
26] showed that approximate methods perform better than exact ones for binominal variables. It is worth deriving approximate methods for the estimation of confidence limits, not only due to their precision but also due to their simplicity [
26], which allows practitioners and researchers to implement them using basic software packages such as Excel [
27] [
28]. Therefore, in the following lines, we will derive approximate Clopper-Pearson expressions [
29] to approximate the lower bound of the intrinsic kappa coefficient, which will be tested in the results section. Since accuracy is a simple binomial variable, we can rely on Agresti & Coull’s results [
26] to use its approximate expressions, which will also be derived in the next section.
2.2. Approximate Lower Bound of the Intrinsic Kappa Coefficient and Accuracy for any Number of Categories
To derive asymptotic approximate expressions for confidence limits of any statistic, we must start by deriving the variance of the statistic point estimate. Therefore, from Eq. (4):
According to Clopper & Pearson [
29] and Agresti & Coull [
26], from the statistic variance, we can construct the approximate confidence interval (CI) for
p by inverting the Wald test for
p.
From the inverted hypothesis test:
that uses the z statistic
,
we can derive the inverted confidence interval [
26], which is:
It is well-known that
. Therefore,
However, since we do not usually have the population parameter
p, the approximate CI is commonly calculated using an estimator, which is the parameter point estimate
. Therefore, the resulting Wald interval for
p, which, according to Agresti & Coull [
26], is one of the first parameter intervals ever derived is:
If we are interested in one bound, the expressions are:
for the lower and upper bound, respectively.
It is well-known that, based on the central limit theorem (CLT), Wald’s hypothesis test and its derived interval have been generalised as a method to define normal approximations for CIs of any statistical parameter. This generalisation can be expressed as:
where
is the parameter of interest and
its estimator.
The expressions for one-bound estimations are:
Therefore, applying Wald’s generalisation for the intrinsic kappa coefficient from (15), we obtain
is defined in Eq. (10), which leads us to
Like what happens with the
p’s CI,
usually is not known, so we need to use an estimator for
. Like in Wald’s interval, the most obvious option is using
’s point estimate; thus,
where
is the value of the inverse standard normal distribution function for α significance level;
;
N is the total sample size.
Agresti & Coull [
26] showed that the approximate Wald’s adjusted method can improve the results of the original Wald’s method by adding two failures and four instances to the point estimate. It means that
in Eq. (18).
The following section will confirm that Eq. (18) approximates the value of kLB well for a wide range of sample sizes and accuracy rates. It will also compare results from adjusted and non-adjusted approximate lower bound.
Since the accuracy is a binomial statistic, based on Agresti and Coull’s results [
26], we must apply the adjusted Wald’s approximate method for the proportion statistic. Therefore, for our purpose, we will have that
where
. Notice that the point estimate must be adjusted by adding two counts to the smallest proportion, the number of failures or wrong-classified instances [
26].