Evaluating metrics for Multi-label Classification and Implementations

This article is the continuation of this post.
First, let’s learn about what is Multi-label classification. Multi-label classification is nothing but, having many No. of labels for one sample of data in training data.
Example: Let’s suppose in a Picture, we have a ball, pen, and bat at the same time, the classification problem is, whether these 3 items are there in the input picture are not. This means, that instead of 1 label( as in previous types of classification), we have 3 labels for one data sample.
we can produce multi-label classification data temporarily for now, with the help of the below code.
X,y = make_multilabel_classification(sparse = False, n_samples = 100, n_labels = 4, allow_unlabeled = False)
Now, I’m gonna explain the model building and different approaches of model building for multi-label classification, as it’s different from the traditional model building like in binary(or) multi-label classification. If you want to go to metrics, you can skip this section.
There are 3 ways we can build a model for multi-label classification problem
- Problem Transformation
- Adapted Algorithm
- Ensemble Approaches
Let’s explore every way…
- Problem Transformation: This can be carried out in 3 ways
a. Binary relevance
b. Classifier Chains
c. Label powerset
Let’s think, we have input data X, and label data [y1, y2, y3, y4]
A. Binary Relevance: In Binary Relevance, multi-label classification will get turned into single-class classification.
Converting into single-class classification, pairs will be formed like(X, y1),(X, y2),(X, y3), and (X, y4).
Now, the model will consider every pair, and train with X and yi.
so 4 outputs will be generated.
from sklearn.datasets import make_multilabel_classification
from skmultilearn.problem_transform import BinaryRelevance, ClassifierChain, LabelPowerset
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy import sparseX,y = make_multilabel_classification(sparse = False, n_samples = 100, n_labels = 4, allow_unlabeled = False)X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)tree_model = DecisionTreeClassifier()meta_model = BinaryRelevance(tree_model)meta_model.fit(X_train,y_train)predictions = meta_model.predict(X_test)
print(metrics.accuracy_score(y_test, predictions))
B. Classifier Chains
Here, training is gonna be like,
(input) — — (output)
X — — — y1
X, y1 — — y2
X, y1, y2 — — — y3
X, y1, y2, y3 — — — y4
So, 4 outputs will be produced…
code snippet:
from sklearn.datasets import make_multilabel_classification
from skmultilearn.problem_transform import BinaryRelevance, ClassifierChain, LabelPowerset
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy import sparseX,y = make_multilabel_classification(sparse = False, n_samples = 100, n_labels = 4, allow_unlabeled = False)X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)tree_model = DecisionTreeClassifier()meta_model = ClassifierChain(tree_model)meta_model.fit(X_train,y_train)predictions = meta_model.predict(X_test)
print(metrics.accuracy_score(y_test, predictions))
C. Label Powerset: Here, for No. of samples of data we have, a number will be assigned to the different combinations of sets of labels. for example,

in the above 6 data samples, as we can see,x1 and x4 have the same set of labels and, x3 and x6 have the same set of labels. so we can create a new column in the dataset, assign numbers like below, and remove the remaining set of labels.

Now, the problem turned into multi-class classification, so we know the traditional way of building a model. right?
Now, Adapted algorithms are nothing but algorithms specifically designed for the multi-label classification itself. you can get the implementation code from algorithms GitHub(I'm gonna provide the links).
Ensemble Approaches are the same as the traditional ensemble approaches and you can understand them easily(I’m not going to detail, cuz this article focuses on metrics only). the below links gonna definitely help you.
Links:
- http://scikit.ml/api/skmultilearn.html
- https://github.com/scikit-multilearn/scikit-multilearn/blob/master/skmultilearn/adapt/mlknn.py
Go through the GitHub implementations of scikit-multilearn, so that you’ll understand the code too.
Now, let’s jump into metrics.
Now, coming to metrics,
As I always say, we can decide the performance in different ways, like using the predefined and existing metrics and designing our own metrics according to our problem.
Here, I’m gonna give some existing metrics for Multi-label classification with code.
The Metrics we are gonna discuss are,
- Precision @ K.
- Average Precision @ K.
- Mean Average Precision @ K.
- Sampled F1-Score.
Precision @ K:
ydef patk(actual, pred, k):
if k == 0:
return 0
k_pred = pred[:k] actual_set = set(actual) pred_set = set(k_pred) common_values = actual_set.intersection(pred_set) return len(common_values)/len(pred[:k])y_true = [1 ,2, 0]
y_pred = [1, 1, 0]if __name__ == "__main__":
print(patk(y_true, y_pred,3))
Don’t get confused with the definition of precision we discussed earlier, with this precision in precision @k. According to my previous blogs, precision is nothing but “how much accurately +ve values got predicted in Actual +ve values”. but here, precision is “how much(or many) correctly, every prediction is made”(you’re gonna understand this with the below example). So, Precision @ K is nothing but, evaluating how correctly, a prediction is made consider only the top K elements in the list of elements of the label for a particular sample.
In the above code, we considered value 3 for K, and the length of y_true, y_pred is also 3, so we are considering all the elements in the label.
In Precision @ K, we are calculating precision for only one sample, think of it like there is a photo, which has Bat, a ball, and a pen, let’s assign it a value like Bat=1, ball=2, pen = 0.
So now, in the above code, given y_true = [1 ,2, 0], and y_pred = [1, 1, 0].This means, actually in the photo, a ball, bat, and pen are there, but according to our model, 2 bats and a pen is there.
now as we can see, 2 out of 3 elements are correct, so 66.666% of the prediction is correctly made, so precision @ K will be 0.6666666.
Normally, we can use precision @ K, for one sample.
Note: Normally, we use mean average precision @ k, as it represents the optimal accuracy of the overall data. both( precision @k, average precision @k ) will lead to mean average precision @ K. I’m just explaining this(precision, average precision), so that we know what are the components(precision, average precision), of mean average precision.
Mean Average @ K:
Now, here we are gonna work with N no. of samples with N>1,
Consider the following data:
y_true = [[1,2,0,1], [0,4], [3], [1,2]] # here we have 4 labels
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
Now, here we calculate precision for every label(set of elements), for different values of K(from 0 to K), and average all those precisions, called Average Precision @ K.
import numpy as npdef patk(actual, pred, k):
if k == 0:
return 0k_pred = pred[:k]actual_set = set(actual)pred_set = set(k_pred)common_values = actual_set.intersection(pred_set)return len(common_values)/len(pred[:k])
def apatk(acutal, pred, k):precision_ = []
for i in range(1, k+1):precision_.append(patk(acutal, pred, i))if len(precision_) == 0:
return 0return np.mean(precision_)y_true = [[1,2,0,1], [0,4], [3], [1,2]]
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]if __name__ == "__main__":
for i in range(len(y_true)):
for j in range(1, 4):
print("for K = "+str(j)+"average precision is "+str(apatk(y_true[i], y_pred[i], k=j)))
Mean Average Precision @ K:
Mean Average Precision @ K is nothing but, we are averaging the average precisions of every label, which represents the accurate metric for the real data.
import numpy as npdef patk(actual, pred, k):
if k == 0:
return 0k_pred = pred[:k]actual_set = set(actual)pred_set = set(k_pred)common_values = actual_set.intersection(pred_set)return len(common_values)/len(pred[:k])
def apatk(acutal, pred, k):precision_ = []
for i in range(1, k+1):precision_.append(patk(acutal, pred, i))if len(precision_) == 0:
return 0return np.mean(precision_)y_true = [[1,2,0,1], [0,4], [3], [1,2]]
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]def mapk(acutal, pred, k):#creating a list for storing the Average Precision Values
average_precision = []
#interating through the whole data and calculating the apk for each
for i in range(len(acutal)):
average_precision.append(apatk(acutal[i], pred[i], k))#returning the mean of all the data
return np.mean(average_precision)if __name__ == "__main__":
print(mapk(y_true, y_pred,3))
Here, most of my code is from different resources, as implementation is gonna be the same for existing metrics.
Log Loss:
Here, what we can do is, binarize every element in the labels of the whole dataset, create different binary columns, calculate the log loss for every column, and average them to get the average log loss.
This is also known as Mean Column-wise log loss. we can implement that easily. If you want implementation, go through my GitHub here.
F1-score:
F1-score is also the same as the log loss, after binarizing the columns, apply the f1_score metric to it(or as I said for log loss, calculate and average f1_score for binary column labels, anyway result is going to be the same).
from sklearn.metrics import f1_scorefrom sklearn.preprocessing import MultiLabelBinarizerdef f1_sampled(actual, pred): mlb = MultiLabelBinarizer() actual = mlb.fit_transform(actual) pred = mlb.fit_transform(pred) f1 = f1_score(actual, pred, average = "samples") return f1y_true = [[1,2,0,1], [0,4], [3], [1,2]]y_pred = [[1,1,0,1], [1,4], [2], [1,3]]if __name__ == "__main__": print(f1_sampled(y_true, y_pred))
Now, you can confidently say, “I know multi-label classification metrics😊”
Conclusion:
- Instead of Precision, Average Precision, and Mean Average Precision @ K, We can apply to Mean Average Precision @ K, to know the accuracy of the whole data. Apply remaining metrics according to the requirement.
- With this Article, Classification metrics are completed. My next topic will depend on what readers want.
___________________________________________________________________
You guys can ping me your fav topic on LinkedIn, and follow my GitHub here.
Follow me on Medium:- https://iamvishnu-varapally.medium.com/
Happy Learning✌!!