Evaluation Methods in Natural Language Processing ( NLP ): Part-1

8 min readApr 27, 2023

Evaluation in NLP (Natural Language Processing) refers to the process of assessing the quality and performance of NLP models. It involves measuring how well a model is able to complete a specific NLP task, such as text classification, sentiment analysis, machine translation, or question answering. Evaluation is typically performed using metrics that reflect the accuracy or effectiveness of the model. These metrics may vary depending on the task and the specific goals of the evaluation. For example, accuracy, precision, recall, and F1-score are common metrics for evaluating text classification and information retrieval models, while BLEU and ROUGE are metrics used in machine translation evaluation, Perplexity and WER metrics is used for Automatic speech recognition and text generation.

Let’s take a deep dive in accuracy, precision, recall, F1 score and BLEU score in this part.

Accuracy, precision, recall and F1-score

These metrics are especially useful in classification tasks such as sentiment analysis, named entity recognition, and text classification. All these metrics are dependent on TP, TN, FP and FN. TP (True Positive) be the number of correctly identified positive instances, TN (True Negative) be the number of correctly identified negative instances, FP (False Positive) be the number of incorrectly identified positive instances, and FN (False Negative) be the number of incorrectly identified negative instances.

Accuracy: Accuracy is the proportion of correctly classified instances out of the total number of instances. In NLP tasks, accuracy is often used to measure the overall performance of a model.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision: Precision is the proportion of true positives (correctly identified instances) to the total number of instances identified as positive. In NLP tasks, precision is used to measure how many of the instances identified as positive are actually positive.

Precision = TP / (TP + FP)

Recall: Recall is the proportion of true positives to the total number of instances that are actually positive. In NLP tasks, recall is used to measure how many of the positive instances were correctly identified by the model.

Recall = TP / (TP + FN)

F1-score: F1-score is the harmonic mean of precision and recall, and is used to balance the trade-off between precision and recall. It ranges from 0 to 1, with 1 being the best possible score. In NLP tasks, F1-score is often used to evaluate the overall performance of a model.

F1-score = 2 * (Precision * Recall) / (Precision + Recall)

Let’s implement a text classification model and try to find out the evaluation metrics for the same. For this we are going to use the dataset from here.

While running sklearn.metrics.classification_report, we get the following classification report:

We can see the recall, precision and f1-score per class and average. Instead, let us look at the confusion matrix for a holistic understanding of all the classes which we have taken as Y label.

The confusion matrix allows us to compute the values of True Positive (TP), False Positive (FP), and False Negative (FN), as shown below. Also, Let’s calculate precision, recall and F1-score for each class according to formula mentioned above.

Having multiple per-class F1 scores, average of it would be better to check overall performance of a model. However, this is good when we are dealing with class imbalance or Named entity recognition task.

Let’s take Macro, Weighted and Micro average of both the classes.

The macro-average is computed using the arithmetic mean of all the per-class F1 scores which treats all classes equally regardless of their support values.

The weighted-averaged is calculated by taking the mean of all per-class F1 scores with consideration of each class’s support. Support mean the no. of sample used per class in calculation of classification metrics.

The Micro average computes a global average by counting the sums of the True Positives (TP), False Negatives (FN), and False Positives (FP).

Here, the question is we are not seeing precision and recall value for accuracy ( Micro Average ). Micro average is not mentioned here only the accuracy is given because micro-average essentially computes the proportion of correctly classified observations out of all observations. In addition, if we were to do micro-average for precision and recall, we would get the same value.

Based on this we can conclude that in case of class imbalance we can deal with macro and weighted average and in case of balanced dataset overall accuracy would be fine to evaluate model performance.

BLEU (Bilingual Evaluation Understudy)

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“.

BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of machine-generated text or translations in Natural Language Processing (NLP). It is a statistical measure that calculates the degree of similarity between a machine-generated sentence and one or more reference sentences. The BLEU score ranges from 0 to 1, with 1 indicating perfect similarity between the machine-generated text and the reference text.

BLEU score is also used in other NLP tasks such as text summarization and image captioning. However, it is important to note that the BLEU score is not always an accurate measure of the quality of machine-generated text, as it has certain limitations and can produce misleading results in some cases.

BLEU Score is between 0 and 1. Thus, score of >0.6 is considered the best you can achieve. Even humans would rarely achieve a perfect match. For this reason, a score closer to 1 is not realistic so it is called a model is over fitting when BLEU score is 1.

Before we get into how BLEU Score we need to understand N-grams and Precision.

N-gram

N-gram is basically a concept used in NLP for regular text processing . It is actually a set of consecutive word is a order.

For instance, in the sentence “This book is informative”, we could have n-grams such as:

1-gram (unigram): “This”, “book”, “is”, “informative”
2-gram (bigram): “This book”, “book is”, “is informative”
3-gram (trigram): “This book is”, “book is informative”
4-gram: “This book is informative”

Precision

This metric we have already discussed above. But here it works in this way.

For example, we have:

Target Sentence: This book is informative
Predicted Sentence: That book is informative

Precision = Number of correct predicted words / Number of total predicted words

Precision = 3 / 4

But here we need to handle it in different way. Precision is good when we would have proper class to predict. We need to handle it with the method called Clipped Precision.

Clipped Precision

Let’s take an example to understand how it works.

Target Sentence 1: This book is very informative
Predicted Sentence: This book book is is informative

Now, we will compare each predicted word with target sentences, if words matched we will consider it as correct. We limit the count for each correct word to the maximum number of times that word occurs in the Target Sentence. Which helps to avoid word repetition.

Here, the word “book” occurs only once in Target Sentence. So, we have clipped the word “book” even though it occurs twice in the prediction.

Clipped Precision = Clipped number of correct predicted words / Number of total predicted words

Clipped Precision = 4/ 6

Let’s calculate BLEU score.

The mathematical representation of BLEU score is.

Where,

Let’s take one example and calculate n-gram precision score.

Sentence : The cat is on the mat

Predicted : The cat is on the table.

As per formula mentioned above let’s calculate precision

1- gram precision = 5/6

2-gram precision = 4/5

3-gram precision = 3/4

4-gram precision= 2/3

Now, we combine these scores using Geometric Average Precision (N) formula. We can use different values of N using different weights.

Let’s dive in another step which is Brevity Penalty

Brevity Penalty

As we had seen the words like “the” and “cat” the precision is 1/1 which misleads because it encourage model to give higher score to few words. To avoid this, the Brevity Penalty penalizes sentences that are too short.

Based on this formula is we predict very few words this Brevity Penalty is small and it can not be larger than 1 even if predicted sentence is larger than target.

Finally, BLEU Score is calculated by multiplying the Brevity Penalty with the Geometric Average of the Precision Scores.

However, there are different implementation on interent for BLEU score but the mathemetics behind this is with below formula.

Finally, BLEU score is calculated mostly with N=4 where geometric average of unigram and bigram are taken into consideration.

Cons :

Despite its popularity, Bleu Score has been criticized for its weaknesses.

BLEU Score does not consider the meaning of words, which can lead to incorrect assessments.
BLEU Score only recognizes exact word matches and does not consider variants of the same word.
BLEU Score does not distinguish between the importance of words and penalizes incorrect words equally, regardless of their relevance to the sentence.
BLEU Score does not take into account the order of words, which can result in different sentences with the same words receiving the same (unigram) Bleu Score.

We will take a look on ROUGH score, WER and perplexity in another article.

Suggestions are always welcome.

Reference :

Evaluating models | AutoML Translation Documentation | Google Cloud

Note: AutoML Translation capabilities are offered by both the AutoML API and the Cloud Translation - Advanced API. We…

cloud.google.com

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/