🔧 Handling Noisy Labels in Text Classification
Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to
One of the most common bottlenecks in building Supervised Machine Learning systems for real world problems is availability of high quality labelled data. With recent advancements such as Large Language Models, people have started exploring pre-trained models / embeddings in a plug-and-play fashion to address the data gap. However, in real world scenarios, we often encounter problem statements that depend on domain specific data with noisy labels or no labels at all. While manual annotation of data may be explored, the process can end up being quite expensive. Plus, manually labelled data is also susceptible to noise and errors due to multiple factors such as annotation ambiguity, human subjectivity, fatigue, etc., thus, needing further curation. Such situations can demand fine-tuning cost effective models using the available resources.
For over a decade now, I have been primarily focusing on Natural Language Understanding problems and on numerous occasions have ended up exploring multiple strategies ranging from designing robust loss functions to exploring weaker forms of supervision (weak supervision, semi-supervision, etc.) to handle data with noisy labels. If we were to revisit some of those problems today, then I would prefer having access to a common framework that would run quality checks on input data, thereby, decoupling downstream models that can consume high quality data across use-cases. Let's call this framework "Data Quality Checks" or DQC for short. While there have been initiatives around DQC, such efforts either did not seem to evolve into an active general framework or have stringent license requirements when it comes to commercial use.
In this article, I want to walk through how you could build your own DQC for text classification data with noisy labels. On a publicly available benchmark dataset, we will learn how we can identify samples with reliable labels selected from a pool of samples with potentially noisy labels. Towards the end of the article, I'll also share details about DQC Toolkit, an open source library I'm building on similar lines to curate such noisy data.
Let’s get started.
Dataset
For the purpose of demonstration, we are going to consider the popular text classification benchmark dataset - AGNews. We will use Huggingface's datasets package to load it.
from datasets import load_dataset
import pandas as pd
dataset = 'ag_news'
dset = load_dataset(dataset)
dset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 120000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 7600
})
})
Let's create train_data
and val_data
datasets
train_data = pd.DataFrame(dset['train'])
val_data = pd.DataFrame(dset['test'])
print(f'Number of training samples: {len(train_data)}')
print(f'Number of validation samples: {len(val_data)}')
train_data.head()
Number of training samples: 120000
Number of validation samples: 7600
text | label | |
---|---|---|
0 | Wall St. Bears Claw Back Into the Black (Reute... | 2 |
1 | Carlyle Looks Toward Commercial Aerospace (Reu... | 2 |
2 | Oil and Economy Cloud Stocks' Outlook (Reuters... | 2 |
3 | Iraq Halts Oil Exports from Main Southern Pipe... | 2 |
4 | Oil prices soar to all-time record, posing new... | 2 |
Let's look at the label distribution of train_data
and val_data
to ensure that the label distributions are similar
train_data['label'].value_counts(1)
label
2 0.25
3 0.25
1 0.25
0 0.25
Name: proportion, dtype: float64
val_data['label'].value_counts(1)
label
2 0.25
3 0.25
1 0.25
0 0.25
Name: proportion, dtype: float64
The label distributions match. We can stick to the default train-validation splits for now.
Simulate noisy labels
Since the goal of the article is to study the impact of noisy labels in the data, let's mislabel some of the samples in our training data in a controlled setting to be able to benchmark our DQC performance.
import numpy as np
import pandas as pd
from pandas._typing import RandomState
from typing import Union, Tuple
def add_asymmetric_noise(
labels: pd.Series,
noise_prob: float,
random_state: Union[RandomState, None] = 42,
) -> Tuple[pd.Series, float]:
"""
Util function to add asymmetric noise to labels
for simulation of noisy label scenarios.
Args:
labels (pd.Series): Input pandas series with integer values
ranging from 0 to n - 1.
noise_prob (float): Probability of adding noise to each value.
random_state (Union[RandomState, None]): Random seed for reproducibility
Returns:
pd.Series: Series with asymmetric noise added to it.
float: Normalized quantification of pairwise disagreement between `labels` and `noisy_labels` for parity check
"""
# Set seed
np.random.seed(random_state)
# Avoid modifying the original data
noisy_labels = labels.copy()
# Build a replacement dictionary
unique_labels = list(set(noisy_labels))
replacement_dict = {label: [candidate for candidate in unique_labels
if candidate != label]
for label in unique_labels}
# Determine the number of samples to modify based on the noise probability
num_samples = min(len(noisy_labels), int(len(noisy_labels) * noise_prob + 1))
# Sample random indices from the labels to introduce noise
target_indices = np.random.choice(len(noisy_labels), num_samples, replace=False)
for idx in target_indices:
# Introduce noise
noisy_labels[idx] = np.random.choice(replacement_dict[noisy_labels[idx]])
# Parity check
num_mismatches = sum([label != noisy_label for label, noisy_label
in zip(labels.values, noisy_labels.values)])
observed_noise_ratio = num_mismatches / len(noisy_labels)
return noisy_labels, observed_noise_ratio
We've created a function add_asymmetric_noise
that takes in our input labels
and returns the labels after adding noise to it with a probability noise_prob
. The variable observed_noise_ratio
indicates whether the final noise ratio in our labels match noise_prob
. Let's test it out
noisy_labels, observed_noise_ratio = add_asymmetric_noise(train_data['label'],
noise_prob=0.5)
observed_noise_ratio
0.5000083333333334
We've assumed that we'd like 50% of the data labels to be noisy. So noise_prob
is set to 0.5.observed_noise_ratio
is also close to 0.5. So, we are good to go. Now, let's create a new column 'noisy_label' in train_data
train_data['noisy_label'] = noisy_labels
train_data.head()
text | label | noisy_label | |
---|---|---|---|
0 | Wall St. Bears Claw Back Into the Black (Reute... | 2 | 0 |
1 | Carlyle Looks Toward Commercial Aerospace (Reu... | 2 | 3 |
2 | Oil and Economy Cloud Stocks' Outlook (Reuters... | 2 | 2 |
3 | Iraq Halts Oil Exports from Main Southern Pipe... | 2 | 0 |
4 | Oil prices soar to all-time record, posing new... | 2 | 0 |
Out of curiosity, let's check what the label distribution looks like
train_data['noisy_label'].value_counts(1)
noisy_label
0 0.251325
2 0.250467
1 0.249333
3 0.248875
Name: proportion, dtype: float64
Yikes ! Distribution is extremely similar to the original label distribution with 50% of the labels noisy, thus, appearing error-free at face value.
Pre-processing
Let's define a function that can be used to preprocess the text to be consumed in our Machine Learning pipeline. Pre-processing based on our current dataset would include converting text to lowercase, retaining only alphanumeric characters and removing stopwords.
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
def text_preprocessor(text):
# Function to perform text cleaning and pre-processing
# Convert text to lowercase
text = text.lower()
# Remove special characters, punctuation, and numbers using regex
text = re.sub(r"[^a-z0-9]", " ", text)
# Tokenize text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token not in stop_words]
# Join tokens back into text
preprocessed_text = ' '.join(filtered_tokens)
return preprocessed_text
train_data['text'] = train_data['text'].apply(text_preprocessor)
val_data['text'] = val_data['text'].apply(text_preprocessor)
Supervised Learning
Now that we have the data in place, let's build a baseline model. In this article, we will use Sklearn's TfidfVectorizer combined with Logistic Regression as our classification pipeline.
Note : Although, pre-trained embeddings can show better performance than TF-IDF while not even needing much text preprocessing, we stick to TF-IDF for the demo because we'd like to avoid potential leakage (since the embeddings could have been exposed to publicly available datasets like AGNews during training) that could lead us to draw incorrect conclusions regarding impact of noisy labels.
Baselines
Without DQC
Our baseline is a model trained on all available noisy data without any quality checks. For evaluation, we use the F1 score metric.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
pipeline.fit(train_data['text'], train_data['noisy_label'])
y_pred = pipeline.predict(val_data['text'])
y_val = val_data['label']
f1_score(y_val, y_pred, average='weighted')
0.8017542019951156
Let's also display the confusion matrix to get a better sense of the performance.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
labels = pipeline.classes_
cm = confusion_matrix(y_val, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=labels)
disp.plot()
Looks like mismatches between class labels 2 and 3 are relatively higher compared to other possible pairs
With DQC
For the purpose of our experiments, we are going to implement a common approach to assess the quality of labels in a classification setting - cross validation based selection
. Concretely, we follow a three step process -
- Split the training data into multiple subsets
- We predict the label for each sample in a given subset by using a model trained on the remaining subsets
- We identify samples whose predicted labels match the original noisy label with a certain level of confidence.
Ideally, for step 1, we'd like to leverage LeaveOneOut crossvalidation (where each subset has only one sample). Keeping feasibility in mind, we run a Stratified K-fold split instead. For demo purposes, we set the number of folds to 5
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm
cv = StratifiedKFold(n_splits=5)
label_correctness_score_list = []
predicted_labels_list = []
prediction_probability_list = []
for train_index, val_index in tqdm(cv.split(train_data['text'],
train_data['noisy_label'])):
X_train, X_val = (
train_data.loc[train_index, 'text'].values,
train_data.loc[val_index, 'text'].values,
)
y_train, y_val = (
train_data.loc[train_index, 'noisy_label'].values,
train_data.loc[val_index, 'noisy_label'].values,
)
# Train the model
curate_pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
curate_pipeline.fit(X_train, y_train)
# Assess y_val correctness
y_pred_probs = curate_pipeline.predict_proba(X_val)
label_list = curate_pipeline.classes_.tolist()
y_val_scores = [
y_pred_probs[row_index, label_list.index(label)]
for row_index, label in enumerate(y_val)
]
# Get suggested correct label and confidence score
y_pred_max_probs = np.max(y_pred_probs, axis=1).tolist()
y_pred = [label_list[index]
for index in np.argmax(y_pred_probs, axis=1)]
label_correctness_score_list.extend(y_val_scores)
predicted_labels_list.extend(y_pred)
prediction_probability_list.extend(y_pred_max_probs)
Note that we also define a variable label_correctness_score
. Essentially, for each sample, we store the model's confidence score for the given noisy label irrespective of the model's top prediction for that sample. This helps quantify the reliability of the given labels.
Now, we add the predictions to our dataframe.
train_data['predicted_label'] = pd.Series(predicted_labels_list)
train_data['prediction_probability'] = pd.Series(prediction_probability_list)
train_data['label_correctness_score'] = pd.Series(label_correctness_score_list)
Great! We've accomplished step 1 and step 2. Let's move on to step 3.
threshold = 0.5
train_data["is_label_correct"] = train_data.apply(lambda x: True if (x["predicted_label"] == x['noisy_label']) and (
x["label_correctness_score"] > threshold) else False, axis=1)
train_data_filtered = train_data.loc[train_data['is_label_correct']].reset_index(drop=True)
print(f'Number of samples : {len(train_data_filtered)}')
train_data_filtered.head()
Number of samples : 18488
text | label | noisy_label | predicted_label | prediction_probability | label_correctness_score | is_label_correct | |
---|---|---|---|---|---|---|---|
0 | oil economy cloud stocks outlook reuters reute... | 2 | 2 | 2 | 0.655066 | 0.655066 | True |
1 | oil economy cloud stocks outlook new york reut... | 2 | 2 | 2 | 0.678507 | 0.678507 | True |
2 | google ipo faces playboy slip bidding gets und... | 2 | 2 | 2 | 0.519994 | 0.519994 | True |
3 | open source apps developer sugarcrm releases s... | 3 | 3 | 3 | 0.551288 | 0.551288 | True |
4 | comets asteroids planets around nearby star sp... | 3 | 3 | 3 | 0.708582 | 0.708582 | True |
We identify samples that we believe are labelled correctly using a confidence score threshold of 0.5. The number of samples we obtain is 18,488. In the ideal scenario, we would love to obtain 60,000 samples (because 50% of 120,000 samples were noisy). However, this is a tradeoff based on the confidence score threshold set. Now, let's train a classification pipeline as we did with our baseline for 'Without DQC' and evaluate the performance
pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
pipeline.fit(train_data_filtered['text'], train_data_filtered['noisy_label'])
y_pred = pipeline.predict(val_data['text'])
y_val = val_data['label']
f1_score(y_val, y_pred, average='weighted')
0.8792279567377235
Definitely an improvement ! As before, let's also look at the confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
labels = pipeline.classes_
cm = confusion_matrix(y_val, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=labels)
disp.plot()
There's an overall improvement in performance. Mismatches between class labels 2 and 3 are still relatively higher compared to other possible pairs
Reproducibility
Although the results look good, we also need to verify whether they are reproducible under different seed settings. Additionally, there are a few more parameters we could account for -
- The train to validation ratio in our current setting is too high (120000:7600 ~= 15:1). Let's make it 4:1 which is more commonly used in pratice.
- It would be interesting to do a performance comparison between 'With DQC' and 'Without DQC' for different amounts of noise in the labels.
Let's run the experiments for three random seed settings for different noise levels ranging from 0 (no noisy labels) to 0.5 (50% noisy labels). We start by generating the random seeds
from typing import List
import random
def generate_random_seeds(num_seeds:int, seed:int) -> List[int]:
"""Generate reproducible random seeds.
Args:
num_seeds (int): Number of random seeds to generate
seed (int): Seed value for reproducibility
Returns:
List[int]: List of random seed values
"""
rng = random.Random(seed)
random_seeds = [rng.randint(1, 1000) for _ in range(num_seeds)]
return random_seeds
num_seeds = 3
random_seeds = generate_random_seeds(num_seeds=num_seeds, seed=42)
random_seeds
[655, 115, 26]
If you'd like to experiment with a different number of seed settings, pass a modified num_seeds
to function generate_random_seeds
accordingly. Now let's define the noise levels and a combined dataframe data
that will be split into train_data
and val_data
for each seed setting.
noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
column_list = ['text', 'label']
data = pd.DataFrame(columns=column_list)
split_list = ['train', 'validation', 'test']
for split in split_list:
if split in dset.keys():
df = pd.DataFrame(dset[split])[['text', 'label']]
data = pd.concat([data, df],
ignore_index=True)
data = data.reset_index(drop=True)
data['label'] = data['label'].astype(int)
Without DQC
First, we run the 'Without DQC' experiment and store the results
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
exp_name = 'Without DQC'
log_df = pd.DataFrame()
for seed_index, seed in enumerate(random_seeds, 1):
print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')
train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
random_state=seed)
train_data = train_data.reset_index(drop=True)
val_data = val_data.reset_index(drop=True)
y_val = val_data['label'].values
print(f'\nRunning experiment..\n')
for index, noise_level in tqdm(enumerate(noise_levels)):
train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'],
noise_prob=noise_level,
random_state=seed)
pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
pipeline.fit(train_data['text'], train_data['noisy_label'])
y_pred = pipeline.predict(val_data['text'])
score = f1_score(y_val, y_pred, average='weighted')
log_df = pd.concat([log_df, pd.DataFrame({
'Approach' : exp_name,
'Noise (%)': noise_level * 100,
'F1 score' : score,
'Seed Value': seed
}, index=[index])], ignore_index=True)
print(f'\nSeed {seed_index} of {num_seeds} done.\n')
With DQC
Now, we do the same with our quality checker. But first, let's wrap the process into a function for repeated use
def quality_checker(train_data: pd.DataFrame) -> pd.DataFrame:
"""
Runs crossvalidation based selection on input data and returns it with quality check information
Args:
train_data (pd.DataFrame): Input data
Returns:
pd.DataFrame: Data with quality checks
"""
cv = StratifiedKFold(n_splits=5)
label_correctness_score_list = []
predicted_labels_list = []
prediction_probability_list = []
for train_index, val_index in tqdm(cv.split(train_data['text'],
train_data['noisy_label'])):
X_train, X_val = (
train_data.loc[train_index, 'text'].values,
train_data.loc[val_index, 'text'].values,
)
y_train, y_val = (
train_data.loc[train_index, 'noisy_label'].values,
train_data.loc[val_index, 'noisy_label'].values,
)
# Train the model
curate_pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
curate_pipeline.fit(X_train, y_train)
# Assess y_val correctness
y_pred_probs = curate_pipeline.predict_proba(X_val)
label_list = curate_pipeline.classes_.tolist()
y_val_scores = [
y_pred_probs[row_index, label_list.index(label)]
for row_index, label in enumerate(y_val)
]
# Get suggested correct label and confidence score
y_pred_max_probs = np.max(y_pred_probs, axis=1).tolist()
y_pred = [label_list[index]
for index in np.argmax(y_pred_probs, axis=1)]
label_correctness_score_list.extend(y_val_scores)
predicted_labels_list.extend(y_pred)
prediction_probability_list.extend(y_pred_max_probs)
train_data['predicted_label'] = pd.Series(predicted_labels_list)
train_data['prediction_probability'] = pd.Series(prediction_probability_list)
train_data['label_correctness_score'] = pd.Series(label_correctness_score_list)
train_data["is_label_correct"] = train_data.apply(lambda x: True if (x["predicted_label"] == x['noisy_label']) and (
x["label_correctness_score"] > threshold) else False, axis=1)
return train_data
We go ahead and run the experiment
exp_name = 'With DQC'
label_correctness_score_list = []
for seed_index, seed in enumerate(random_seeds, 1):
print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')
train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
random_state=seed)
train_data = train_data.reset_index(drop=True)
val_data = val_data.reset_index(drop=True)
y_val = val_data['label'].values
print('Running experiment..\n')
for index, noise_level in tqdm(enumerate(noise_levels)):
train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'],
noise_prob=noise_level,
random_state=seed)
train_data_modified = quality_checker(train_data)
filtered_indices = train_data_modified.loc[train_data_modified['is_label_correct']].index
X_train = train_data_modified.loc[filtered_indices, 'text'].values
y_train = train_data.loc[filtered_indices, 'noisy_label'].astype(int)
pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(val_data['text'])
score = f1_score(y_val, y_pred, average='weighted')
log_df = pd.concat([log_df, pd.DataFrame({
'Approach' : exp_name,
'Noise (%)': noise_level * 100,
'F1 score' : score,
'Seed Value': seed
}, index=[index])], ignore_index=True)
print(f'\nSeed {seed_index} of {num_seeds} done.\n')
We can preview the results stored in our dataframe log_df
print(f'Number of Entries: {len(log_df)}')
log_df.head()
Number of Entries: 36
Approach | Noise (%) | F1 score | Seed Value | |
---|---|---|---|---|
0 | Without Data Quality Check | 0.0 | 0.917693 | 655 |
1 | Without Data Quality Check | 10.0 | 0.913206 | 655 |
2 | Without Data Quality Check | 20.0 | 0.902473 | 655 |
3 | Without Data Quality Check | 30.0 | 0.888963 | 655 |
4 | Without Data Quality Check | 40.0 | 0.849769 | 655 |
Let's compute the average across seed settings
res_df = (log_df.groupby(['Approach', 'Noise (%)']))['F1 score'].agg(F1_score_mean='mean', F1_score_std='std').reset_index()
res_df
Approach | Noise (%) | F1_score_mean | F1_score_std | |
---|---|---|---|---|
0 | With Data Quality Checks | 0.0 | 0.893723 | 0.002926 |
1 | With Data Quality Checks | 10.0 | 0.890655 | 0.002309 |
2 | With Data Quality Checks | 20.0 | 0.887036 | 0.002380 |
3 | With Data Quality Checks | 30.0 | 0.880883 | 0.002431 |
4 | With Data Quality Checks | 40.0 | 0.869731 | 0.005171 |
5 | With Data Quality Checks | 50.0 | 0.855769 | 0.004319 |
6 | Without Data Quality Check | 0.0 | 0.917357 | 0.002232 |
7 | Without Data Quality Check | 10.0 | 0.913147 | 0.001839 |
8 | Without Data Quality Check | 20.0 | 0.903196 | 0.001554 |
9 | Without Data Quality Check | 30.0 | 0.886368 | 0.003220 |
10 | Without Data Quality Check | 40.0 | 0.853253 | 0.006075 |
11 | Without Data Quality Check | 50.0 | 0.779460 | 0.009221 |
DQC seems to be offering better results when the noise ratio is more than 30%. Let's visualize this to make it more interpretable
import plotly.express as px
fig = px.line(res_df, x="Noise (%)", y="F1_score_mean", color="Approach",
color_discrete_map={'Without Data Quality Check': 'orange',
'With Data Quality Checks': 'blue',
},
labels={"F1_score_mean": "F1 score (Seed Averaged)",})
fig.show()
There you have it. Your own DQC for text classification with noisy labels. Ideally, we'd also like to verify the statistical significance of the difference in performance we are observing. And there are quite a few ways we can improve what we've built. We can cover it in a future post if people are interested.
DQC Toolkit
Instead of building your own DQC, you could also simply use DQC-Toolkit, the open source library mentioned at the beginning of this post, to run quality checks on data. To understand how to use it, let's do a quick demo by extending the reproducibility experiment to DQC Toolkit.
We install it using pip as shown below
pip install dqc-toolkit
Following is a short example of how to use it. The output is similar to the quality_checker
function we previously defined
from dqc import CrossValCurate
cvc = CrossValCurate()
train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')
Let's run it as part of the reproducibility experiment -
exp_name = 'With DQC Toolkit'
for seed_index, seed in enumerate(random_seeds, 1):
print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')
train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
random_state=seed)
train_data = train_data.reset_index(drop=True)
val_data = val_data.reset_index(drop=True)
y_val = val_data['label'].values
print('Running experiment..\n')
for index, noise_level in enumerate(noise_levels):
train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'],
noise_prob=noise_level,
random_state=seed)
cvc = CrossValCurate()
train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')
filtered_indices = train_data_modified.loc[train_data_modified['is_label_correct']].index
X_train = train_data_modified.loc[filtered_indices, 'text'].values
y_train = train_data.loc[filtered_indices, 'noisy_label'].astype(int)
pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
('model', LogisticRegression())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(val_data['text'])
score = f1_score(y_val, y_pred, average='weighted')
log_df = pd.concat([log_df, pd.DataFrame({
'Approach' : exp_name,
'Noise (%)': noise_level * 100,
'F1 score' : score,
'Seed Value': seed
}, index=[index])], ignore_index=True)
print(f'\nSeed {seed_index} of {num_seeds} done.\n')
As before, we average performances across seeds and visualize the results
res_df = (log_df.groupby(['Approach', 'Noise (%)']))['F1 score'].agg(F1_score_mean='mean', F1_score_std='std').reset_index()
fig = px.line(res_df, x="Noise (%)", y="F1_score_mean", color="Approach",
color_discrete_map={'Without DQC': 'orange',
'With DQC': 'blue',
'With DQC Toolkit': 'green'
},
labels={"F1_score_mean": "F1 score (Seed Averaged)"},
category_orders={'Approach': ['Without DQC', 'With DQC', 'With DQC Toolkit']})
fig.show()
DQC Toolkit outperforms 'Without DQC' for noise levels more than 10%. This could be attributed to additional optimizations implemented in the library. It currently supports text classification (binary/multi class) problems with various parameter customization options. Check out the documentation for details. The plan is to enhance it further by adding more capabilities based on your feedback. Following is the link to the repo -
sumanthprabhu
/
DQC-Toolkit
Data quality checks to curate noisy labels in the data
DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate
for curation of text classification datasets (binary / multi-class) using cross validation based selection.
Installation
Installation of DQC-toolkit can be done as shown below
pip install dqc-toolkit
Quick Start
Assuming your text classification data is stored as a pandas dataframe data
, with each sample represented by the text
column and its corresponding noisy label represented by the label
column, here is how you use CrossValCurate
-
from dqc import CrossValCurate
cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])
The result stored in data_curated
which is a pandas dataframe similar to data
with the following columns -
>>> data_curated.columns
['text', 'label',
…
Thank you for reading
Passionate about Machine Learning? Please feel free to add me on Linkedin.
...
🔧 Handling Noisy Labels in Text Classification
📈 64.29 Punkte
🔧 Programmierung
🔧 Handling Duplicate Labels in Pandas
📈 23.62 Punkte
🔧 Programmierung
🔧 Noisy Monsters 🎶👾
📈 20.5 Punkte
🔧 Programmierung