🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security

800+ IT News als RSS Feed abonnieren

Thema auswählen:

📚 Handling Noisy Labels in Text Classification

🕛 Zeit seit Veröffentlichung: 2 Tage, 20 Stunden 30 Minuten
📆 Veröffentlicht am: 24.04.2024 um 17:34 Uhr
💡 Newskategorie: Programmierung
🔗 Quelle: dev.to

One of the most common bottlenecks in building Supervised Machine Learning systems for real world problems is availability of high quality labelled data. With recent advancements such as Large Language Models, people have started exploring pre-trained models / embeddings in a plug-and-play fashion to address the data gap. However, in real world scenarios, we often encounter problem statements that depend on domain specific data with noisy labels or no labels at all. While manual annotation of data may be explored, the process can end up being quite expensive. Plus, manually labelled data is also susceptible to noise and errors due to multiple factors such as annotation ambiguity, human subjectivity, fatigue, etc., thus, needing further curation. Such situations can demand fine-tuning cost effective models using the available resources.

For over a decade now, I have been primarily focusing on Natural Language Understanding problems and on numerous occasions have ended up exploring multiple strategies ranging from designing robust loss functions to exploring weaker forms of supervision (weak supervision, semi-supervision, etc.) to handle data with noisy labels. If we were to revisit some of those problems today, then I would prefer having access to a common framework that would run quality checks on input data, thereby, decoupling downstream models that can consume high quality data across use-cases. Let's call this framework "Data Quality Checks" or DQC for short. While there have been initiatives around DQC, such efforts either did not seem to evolve into an active general framework or have stringent license requirements when it comes to commercial use.

In this article, I want to walk through how you could build your own DQC for text classification data with noisy labels. On a publicly available benchmark dataset, we will learn how we can identify samples with reliable labels selected from a pool of samples with potentially noisy labels. Towards the end of the article, I'll also share details about DQC Toolkit, an open source library I'm building on similar lines to curate such noisy data.

Let’s get started.

Dataset

For the purpose of demonstration, we are going to consider the popular text classification benchmark dataset - AGNews. We will use Huggingface's datasets package to load it.

from datasets import load_dataset
import pandas as pd 

dataset = 'ag_news'
dset = load_dataset(dataset)
dset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Let's create train_data and val_data datasets

train_data = pd.DataFrame(dset['train'])
val_data = pd.DataFrame(dset['test'])

print(f'Number of training samples: {len(train_data)}')
print(f'Number of validation samples: {len(val_data)}')

train_data.head()

Number of training samples: 120000
Number of validation samples: 7600

	text	label
0	Wall St. Bears Claw Back Into the Black (Reute...	2
1	Carlyle Looks Toward Commercial Aerospace (Reu...	2
2	Oil and Economy Cloud Stocks' Outlook (Reuters...	2
3	Iraq Halts Oil Exports from Main Southern Pipe...	2
4	Oil prices soar to all-time record, posing new...	2

Let's look at the label distribution of train_data and val_data to ensure that the label distributions are similar

train_data['label'].value_counts(1)

label
2    0.25
3    0.25
1    0.25
0    0.25
Name: proportion, dtype: float64

val_data['label'].value_counts(1)

label
2    0.25
3    0.25
1    0.25
0    0.25
Name: proportion, dtype: float64

The label distributions match. We can stick to the default train-validation splits for now.

Simulate noisy labels

Since the goal of the article is to study the impact of noisy labels in the data, let's mislabel some of the samples in our training data in a controlled setting to be able to benchmark our DQC performance.

import numpy as np
import pandas as pd
from pandas._typing import RandomState
from typing import Union, Tuple


def add_asymmetric_noise(
    labels: pd.Series,
    noise_prob: float,
    random_state: Union[RandomState, None] = 42,
) -> Tuple[pd.Series, float]:
    """
    Util function to add asymmetric noise to labels
    for simulation of noisy label scenarios.

    Args:
        labels (pd.Series): Input pandas series with integer values
                        ranging from 0 to n - 1.
        noise_prob (float): Probability of adding noise to each value.
        random_state (Union[RandomState, None]): Random seed for reproducibility
    Returns:
        pd.Series: Series with asymmetric noise added to it.
        float: Normalized quantification of pairwise disagreement between `labels` and `noisy_labels` for parity check
    """
    # Set seed
    np.random.seed(random_state)

    # Avoid modifying the original data
    noisy_labels = labels.copy()

    # Build a replacement dictionary
    unique_labels = list(set(noisy_labels))
    replacement_dict = {label: [candidate for candidate in unique_labels
                                if candidate != label]
                        for label in unique_labels}

    # Determine the number of samples to modify based on the noise probability
    num_samples = min(len(noisy_labels), int(len(noisy_labels) * noise_prob + 1))

    # Sample random indices from the labels to introduce noise
    target_indices = np.random.choice(len(noisy_labels), num_samples, replace=False)

    for idx in target_indices:
        # Introduce noise
        noisy_labels[idx] = np.random.choice(replacement_dict[noisy_labels[idx]])

    # Parity check    
    num_mismatches = sum([label != noisy_label for label, noisy_label 
                          in zip(labels.values, noisy_labels.values)])
    observed_noise_ratio = num_mismatches / len(noisy_labels)

    return noisy_labels, observed_noise_ratio

We've created a function add_asymmetric_noise that takes in our input labels and returns the labels after adding noise to it with a probability noise_prob. The variable observed_noise_ratio indicates whether the final noise ratio in our labels match noise_prob. Let's test it out

noisy_labels, observed_noise_ratio = add_asymmetric_noise(train_data['label'], 
                                                 noise_prob=0.5)
observed_noise_ratio

0.5000083333333334

We've assumed that we'd like 50% of the data labels to be noisy. So noise_prob is set to 0.5.observed_noise_ratio is also close to 0.5. So, we are good to go. Now, let's create a new column 'noisy_label' in train_data

train_data['noisy_label'] = noisy_labels
train_data.head()

	text	label	noisy_label
0	Wall St. Bears Claw Back Into the Black (Reute...	2	0
1	Carlyle Looks Toward Commercial Aerospace (Reu...	2	3
2	Oil and Economy Cloud Stocks' Outlook (Reuters...	2	2
3	Iraq Halts Oil Exports from Main Southern Pipe...	2	0
4	Oil prices soar to all-time record, posing new...	2	0

Out of curiosity, let's check what the label distribution looks like

train_data['noisy_label'].value_counts(1)

noisy_label
0    0.251325
2    0.250467
1    0.249333
3    0.248875
Name: proportion, dtype: float64

Yikes ! Distribution is extremely similar to the original label distribution with 50% of the labels noisy, thus, appearing error-free at face value.

Pre-processing

Let's define a function that can be used to preprocess the text to be consumed in our Machine Learning pipeline. Pre-processing based on our current dataset would include converting text to lowercase, retaining only alphanumeric characters and removing stopwords.

import re
from nltk import word_tokenize
from nltk.corpus import stopwords

def text_preprocessor(text):
    # Function to perform text cleaning and pre-processing
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters, punctuation, and numbers using regex
    text = re.sub(r"[^a-z0-9]", " ", text)

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Join tokens back into text
    preprocessed_text = ' '.join(filtered_tokens)

    return preprocessed_text

train_data['text'] = train_data['text'].apply(text_preprocessor)
val_data['text'] = val_data['text'].apply(text_preprocessor)

Supervised Learning

Now that we have the data in place, let's build a baseline model. In this article, we will use Sklearn's TfidfVectorizer combined with Logistic Regression as our classification pipeline.

Note : Although, pre-trained embeddings can show better performance than TF-IDF while not even needing much text preprocessing, we stick to TF-IDF for the demo because we'd like to avoid potential leakage (since the embeddings could have been exposed to publicly available datasets like AGNews during training) that could lead us to draw incorrect conclusions regarding impact of noisy labels.

Baselines

Without DQC

Our baseline is a model trained on all available noisy data without any quality checks. For evaluation, we use the F1 score metric.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline


pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

pipeline.fit(train_data['text'], train_data['noisy_label'])

y_pred = pipeline.predict(val_data['text'])
y_val = val_data['label']

f1_score(y_val, y_pred, average='weighted')

0.8017542019951156

Let's also display the confusion matrix to get a better sense of the performance.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

labels = pipeline.classes_
cm = confusion_matrix(y_val, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=labels)
disp.plot()

Looks like mismatches between class labels 2 and 3 are relatively higher compared to other possible pairs

With DQC

For the purpose of our experiments, we are going to implement a common approach to assess the quality of labels in a classification setting - cross validation based selection. Concretely, we follow a three step process -

Split the training data into multiple subsets
We predict the label for each sample in a given subset by using a model trained on the remaining subsets
We identify samples whose predicted labels match the original noisy label with a certain level of confidence.

Ideally, for step 1, we'd like to leverage LeaveOneOut crossvalidation (where each subset has only one sample). Keeping feasibility in mind, we run a Stratified K-fold split instead. For demo purposes, we set the number of folds to 5

from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm

cv = StratifiedKFold(n_splits=5)

label_correctness_score_list = []
predicted_labels_list = []
prediction_probability_list = []

for train_index, val_index in tqdm(cv.split(train_data['text'], 
                                            train_data['noisy_label'])):
    X_train, X_val = (
                train_data.loc[train_index, 'text'].values,
                train_data.loc[val_index, 'text'].values,
            )
    y_train, y_val = (
        train_data.loc[train_index, 'noisy_label'].values,
        train_data.loc[val_index, 'noisy_label'].values,
    )

    # Train the model
    curate_pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

    curate_pipeline.fit(X_train, y_train)

    # Assess y_val correctness
    y_pred_probs = curate_pipeline.predict_proba(X_val)
    label_list = curate_pipeline.classes_.tolist()

    y_val_scores = [
            y_pred_probs[row_index, label_list.index(label)]
            for row_index, label in enumerate(y_val)
        ]
    # Get suggested correct label and confidence score
    y_pred_max_probs = np.max(y_pred_probs, axis=1).tolist()
    y_pred = [label_list[index] 
             for index in np.argmax(y_pred_probs, axis=1)]

    label_correctness_score_list.extend(y_val_scores)
    predicted_labels_list.extend(y_pred)
    prediction_probability_list.extend(y_pred_max_probs)

Note that we also define a variable label_correctness_score. Essentially, for each sample, we store the model's confidence score for the given noisy label irrespective of the model's top prediction for that sample. This helps quantify the reliability of the given labels.

Now, we add the predictions to our dataframe.

train_data['predicted_label'] = pd.Series(predicted_labels_list)
train_data['prediction_probability'] = pd.Series(prediction_probability_list)
train_data['label_correctness_score'] = pd.Series(label_correctness_score_list)

Great! We've accomplished step 1 and step 2. Let's move on to step 3.

threshold = 0.5
train_data["is_label_correct"] = train_data.apply(lambda x: True if (x["predicted_label"] == x['noisy_label']) and (
            x["label_correctness_score"] > threshold) else False, axis=1)
train_data_filtered = train_data.loc[train_data['is_label_correct']].reset_index(drop=True)
print(f'Number of samples : {len(train_data_filtered)}')
train_data_filtered.head()

Number of samples : 18488

	text	label	noisy_label	predicted_label	prediction_probability	label_correctness_score	is_label_correct
0	oil economy cloud stocks outlook reuters reute...	2	2	2	0.655066	0.655066	True
1	oil economy cloud stocks outlook new york reut...	2	2	2	0.678507	0.678507	True
2	google ipo faces playboy slip bidding gets und...	2	2	2	0.519994	0.519994	True
3	open source apps developer sugarcrm releases s...	3	3	3	0.551288	0.551288	True
4	comets asteroids planets around nearby star sp...	3	3	3	0.708582	0.708582	True

We identify samples that we believe are labelled correctly using a confidence score threshold of 0.5. The number of samples we obtain is 18,488. In the ideal scenario, we would love to obtain 60,000 samples (because 50% of 120,000 samples were noisy). However, this is a tradeoff based on the confidence score threshold set. Now, let's train a classification pipeline as we did with our baseline for 'Without DQC' and evaluate the performance

pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

pipeline.fit(train_data_filtered['text'], train_data_filtered['noisy_label'])

y_pred = pipeline.predict(val_data['text'])
y_val = val_data['label']

f1_score(y_val, y_pred, average='weighted')

0.8792279567377235

Definitely an improvement ! As before, let's also look at the confusion matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

labels = pipeline.classes_
cm = confusion_matrix(y_val, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=labels)
disp.plot()

There's an overall improvement in performance. Mismatches between class labels 2 and 3 are still relatively higher compared to other possible pairs

Reproducibility

Although the results look good, we also need to verify whether they are reproducible under different seed settings. Additionally, there are a few more parameters we could account for -

The train to validation ratio in our current setting is too high (120000:7600 ~= 15:1). Let's make it 4:1 which is more commonly used in pratice.
It would be interesting to do a performance comparison between 'With DQC' and 'Without DQC' for different amounts of noise in the labels.

Let's run the experiments for three random seed settings for different noise levels ranging from 0 (no noisy labels) to 0.5 (50% noisy labels). We start by generating the random seeds

from typing import List
import random

def generate_random_seeds(num_seeds:int, seed:int) -> List[int]:
    """Generate reproducible random seeds.

    Args:
        num_seeds (int): Number of random seeds to generate
        seed (int): Seed value for reproducibility

    Returns:
        List[int]: List of random seed values
    """
    rng = random.Random(seed)

    random_seeds = [rng.randint(1, 1000) for _ in range(num_seeds)]
    return random_seeds

num_seeds = 3
random_seeds = generate_random_seeds(num_seeds=num_seeds, seed=42)
random_seeds

[655, 115, 26]

If you'd like to experiment with a different number of seed settings, pass a modified num_seeds to function generate_random_seeds accordingly. Now let's define the noise levels and a combined dataframe data that will be split into train_data and val_data for each seed setting.

noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]

column_list = ['text', 'label']
data = pd.DataFrame(columns=column_list)

split_list = ['train', 'validation', 'test']
for split in split_list:
    if split in dset.keys():
        df = pd.DataFrame(dset[split])[['text', 'label']]
        data = pd.concat([data, df], 
                                ignore_index=True)

data = data.reset_index(drop=True)
data['label'] = data['label'].astype(int)

Without DQC

First, we run the 'Without DQC' experiment and store the results

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

exp_name = 'Without DQC'
log_df = pd.DataFrame()
for seed_index, seed in enumerate(random_seeds, 1):
    print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')

    train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
                                            random_state=seed)
    train_data = train_data.reset_index(drop=True)
    val_data = val_data.reset_index(drop=True)

    y_val = val_data['label'].values

    print(f'\nRunning experiment..\n')

    for index, noise_level in tqdm(enumerate(noise_levels)):

        train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'], 
                                                  noise_prob=noise_level,
                                                  random_state=seed)

        pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

        pipeline.fit(train_data['text'], train_data['noisy_label'])

        y_pred = pipeline.predict(val_data['text'])

        score = f1_score(y_val, y_pred, average='weighted')
        log_df = pd.concat([log_df, pd.DataFrame({
            'Approach' : exp_name,
            'Noise (%)': noise_level * 100,
            'F1 score' : score,
            'Seed Value': seed

        }, index=[index])], ignore_index=True)
    print(f'\nSeed {seed_index} of {num_seeds} done.\n')

With DQC

Now, we do the same with our quality checker. But first, let's wrap the process into a function for repeated use

def quality_checker(train_data: pd.DataFrame) -> pd.DataFrame:
    """
    Runs crossvalidation based selection on input data and returns it with quality check information

    Args:
        train_data (pd.DataFrame): Input data

    Returns:
        pd.DataFrame: Data with quality checks
    """
    cv = StratifiedKFold(n_splits=5)

    label_correctness_score_list = []
    predicted_labels_list = []
    prediction_probability_list = []

    for train_index, val_index in tqdm(cv.split(train_data['text'], 
                                                train_data['noisy_label'])):
        X_train, X_val = (
                    train_data.loc[train_index, 'text'].values,
                    train_data.loc[val_index, 'text'].values,
                )
        y_train, y_val = (
            train_data.loc[train_index, 'noisy_label'].values,
            train_data.loc[val_index, 'noisy_label'].values,
        )

        # Train the model
        curate_pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                      ('model', LogisticRegression())])

        curate_pipeline.fit(X_train, y_train)

        # Assess y_val correctness
        y_pred_probs = curate_pipeline.predict_proba(X_val)
        label_list = curate_pipeline.classes_.tolist()

        y_val_scores = [
                y_pred_probs[row_index, label_list.index(label)]
                for row_index, label in enumerate(y_val)
            ]
        # Get suggested correct label and confidence score
        y_pred_max_probs = np.max(y_pred_probs, axis=1).tolist()
        y_pred = [label_list[index] 
                 for index in np.argmax(y_pred_probs, axis=1)]

        label_correctness_score_list.extend(y_val_scores)
        predicted_labels_list.extend(y_pred)
        prediction_probability_list.extend(y_pred_max_probs)

        train_data['predicted_label'] = pd.Series(predicted_labels_list)
        train_data['prediction_probability'] = pd.Series(prediction_probability_list)
        train_data['label_correctness_score'] = pd.Series(label_correctness_score_list)

        train_data["is_label_correct"] = train_data.apply(lambda x: True if (x["predicted_label"] == x['noisy_label']) and (
                    x["label_correctness_score"] > threshold) else False, axis=1)

        return train_data

We go ahead and run the experiment

exp_name = 'With DQC'

label_correctness_score_list = []
for seed_index, seed in enumerate(random_seeds, 1):
    print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')

    train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
                                            random_state=seed)
    train_data = train_data.reset_index(drop=True)
    val_data = val_data.reset_index(drop=True)

    y_val = val_data['label'].values

    print('Running experiment..\n')
    for index, noise_level in tqdm(enumerate(noise_levels)):

        train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'], 
                                                  noise_prob=noise_level,
                                                  random_state=seed)

        train_data_modified = quality_checker(train_data)
        filtered_indices = train_data_modified.loc[train_data_modified['is_label_correct']].index

        X_train = train_data_modified.loc[filtered_indices, 'text'].values
        y_train = train_data.loc[filtered_indices, 'noisy_label'].astype(int)

        pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

        pipeline.fit(X_train, y_train)

        y_pred = pipeline.predict(val_data['text'])

        score = f1_score(y_val, y_pred, average='weighted')

        log_df = pd.concat([log_df, pd.DataFrame({
            'Approach' : exp_name,
            'Noise (%)': noise_level * 100,
            'F1 score' : score,
            'Seed Value': seed
        }, index=[index])], ignore_index=True)

    print(f'\nSeed {seed_index} of {num_seeds} done.\n')

We can preview the results stored in our dataframe log_df

print(f'Number of Entries: {len(log_df)}')
log_df.head()

Number of Entries: 36

	Approach	Noise (%)	F1 score	Seed Value
0	Without Data Quality Check	0.0	0.917693	655
1	Without Data Quality Check	10.0	0.913206	655
2	Without Data Quality Check	20.0	0.902473	655
3	Without Data Quality Check	30.0	0.888963	655
4	Without Data Quality Check	40.0	0.849769	655

Let's compute the average across seed settings

res_df = (log_df.groupby(['Approach', 'Noise (%)']))['F1 score'].agg(F1_score_mean='mean', F1_score_std='std').reset_index()
res_df

	Approach	Noise (%)	F1_score_mean	F1_score_std
0	With Data Quality Checks	0.0	0.893723	0.002926
1	With Data Quality Checks	10.0	0.890655	0.002309
2	With Data Quality Checks	20.0	0.887036	0.002380
3	With Data Quality Checks	30.0	0.880883	0.002431
4	With Data Quality Checks	40.0	0.869731	0.005171
5	With Data Quality Checks	50.0	0.855769	0.004319
6	Without Data Quality Check	0.0	0.917357	0.002232
7	Without Data Quality Check	10.0	0.913147	0.001839
8	Without Data Quality Check	20.0	0.903196	0.001554
9	Without Data Quality Check	30.0	0.886368	0.003220
10	Without Data Quality Check	40.0	0.853253	0.006075
11	Without Data Quality Check	50.0	0.779460	0.009221

DQC seems to be offering better results when the noise ratio is more than 30%. Let's visualize this to make it more interpretable

import plotly.express as px

fig = px.line(res_df, x="Noise (%)", y="F1_score_mean", color="Approach",
             color_discrete_map={'Without Data Quality Check': 'orange',
                                 'With Data Quality Checks': 'blue',
                                },
             labels={"F1_score_mean": "F1 score (Seed Averaged)",})
fig.show()

There you have it. Your own DQC for text classification with noisy labels. Ideally, we'd also like to verify the statistical significance of the difference in performance we are observing. And there are quite a few ways we can improve what we've built. We can cover it in a future post if people are interested.

DQC Toolkit

Instead of building your own DQC, you could also simply use DQC-Toolkit, the open source library mentioned at the beginning of this post, to run quality checks on data. To understand how to use it, let's do a quick demo by extending the reproducibility experiment to DQC Toolkit.

We install it using pip as shown below

pip install dqc-toolkit

Following is a short example of how to use it. The output is similar to the quality_checker function we previously defined

from dqc import CrossValCurate

cvc = CrossValCurate()
train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')

Let's run it as part of the reproducibility experiment -

exp_name = 'With DQC Toolkit'

for seed_index, seed in enumerate(random_seeds, 1):
    print(f'\nBuilding artifacts for Seed {seed_index} of {num_seeds}..\n')

    train_data, val_data = train_test_split(data, train_size=0.8, stratify=data['label'],
                                            random_state=seed)
    train_data = train_data.reset_index(drop=True)
    val_data = val_data.reset_index(drop=True)

    y_val = val_data['label'].values

    print('Running experiment..\n')
    for index, noise_level in enumerate(noise_levels):

        train_data['noisy_label'], _ = add_asymmetric_noise(train_data['label'], 
                                                  noise_prob=noise_level,
                                                  random_state=seed)

        cvc = CrossValCurate()
        train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')

        filtered_indices = train_data_modified.loc[train_data_modified['is_label_correct']].index

        X_train = train_data_modified.loc[filtered_indices, 'text'].values
        y_train = train_data.loc[filtered_indices, 'noisy_label'].astype(int)

        pipeline = Pipeline([('feature_extractor', TfidfVectorizer()),
                  ('model', LogisticRegression())])

        pipeline.fit(X_train, y_train)

        y_pred = pipeline.predict(val_data['text'])

        score = f1_score(y_val, y_pred, average='weighted')

        log_df = pd.concat([log_df, pd.DataFrame({
            'Approach' : exp_name,
            'Noise (%)': noise_level * 100,
            'F1 score' : score,
            'Seed Value': seed
        }, index=[index])], ignore_index=True)

    print(f'\nSeed {seed_index} of {num_seeds} done.\n')

As before, we average performances across seeds and visualize the results

res_df = (log_df.groupby(['Approach', 'Noise (%)']))['F1 score'].agg(F1_score_mean='mean', F1_score_std='std').reset_index()
fig = px.line(res_df, x="Noise (%)", y="F1_score_mean", color="Approach",
             color_discrete_map={'Without DQC': 'orange',
                                 'With DQC': 'blue',
                                 'With DQC Toolkit': 'green'
                                },
             labels={"F1_score_mean": "F1 score (Seed Averaged)"},
             category_orders={'Approach': ['Without DQC', 'With DQC', 'With DQC Toolkit']})
fig.show()

DQC Toolkit outperforms 'Without DQC' for noise levels more than 10%. This could be attributed to additional optimizations implemented in the library. It currently supports text classification (binary/multi class) problems with various parameter customization options. Check out the documentation for details. The plan is to enhance it further by adding more capabilities based on your feedback. Following is the link to the repo -

sumanthprabhu / DQC-Toolkit

Data quality checks to curate noisy labels in the data

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated which is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label',

…

View on GitHub

Thank you for reading

Passionate about Machine Learning? Please feel free to add me on Linkedin.

...

Sharing is caring on Social Media

Join the Team IT Security Community

📌 Handling Noisy Labels in Text Classification

🕛 15 Tage, 9 Stunden 42 Minuten
📆 24.04.2024 um 17:34 Uhr
📈 74.63 Punkte

📌 Techotronic all-in-one-favicon Plugin 4.6 on WordPress Apple-Text/GIF-Text/ICO-Text/PNG-Text/JPG-Text Persistent cross site scripting

🕛 1513 Tage, 21 Stunden 16 Minuten
📆 05.03.2020 um 15:38 Uhr
📈 40.36 Punkte

📌 Amazon Rekognition Labels adds 600 new labels, including landmarks, and now detects dominant colors

🕛 515 Tage, 16 Stunden 21 Minuten
📆 22.11.2022 um 03:05 Uhr
📈 31.06 Punkte

📌 CSS Text Handling: text-overflow, overflow-wrap, and More!

🕛 26 Tage, 17 Stunden 12 Minuten
📆 14.04.2024 um 07:01 Uhr
📈 28.61 Punkte

📌 [dos] Microsoft DirectWrite / AFDKO - Stack Corruption in OpenType Font Handling Due to Incorrect Handling of blendArray

🕛 1752 Tage, 20 Stunden 22 Minuten
📆 10.07.2019 um 02:00 Uhr
📈 24.93 Punkte

📌 I posted this three weeks ago - but I am here again to let more people know about my groundbreaking Gimp plugins that use GEGL to style text. These plugins turn plain text into fancy text effortlessly just like Adobe's layer effects - to all interested

🕛 441 Tage, 21 Stunden 49 Minuten
📆 07.02.2023 um 00:00 Uhr
📈 24.21 Punkte

📌 Latest Artificial Intelligence (AI) Research From Czech Republic Proposes â€˜GLAMI-1M,â€™ A Multilingual Image-Text Classification Dataset And Benchmark

🕛 509 Tage, 14 Stunden 45 Minuten
📆 04.12.2022 um 22:49 Uhr
📈 23.88 Punkte

📌 This AI Paper Proposes a Novel Bayesian Deep Learning Model with Kernel Dropout Designed to Enhance the Reliability of Predictions in Medical Text Classification Tasks

🕛 16 Tage, 2 Stunden 30 Minuten
📆 24.04.2024 um 02:00 Uhr
📈 23.88 Punkte

📌 I have problem with my ubunto,the color paleas pic , sometimes sounds noisy and return to normal! And the usb Wifi Interrupted and connect by itself! I installed all profiles and tried ubunto , mint but it’s same problem My process AMD ryzen 7 2700x

🕛 1833 Tage, 11 Stunden 4 Minuten
📆 21.04.2019 um 02:40 Uhr
📈 22.76 Punkte

📌 Chrome extension turns on YouTube captions when eating noisy chips

🕛 1148 Tage, 19 Stunden 4 Minuten
📆 05.03.2021 um 18:38 Uhr
📈 22.76 Punkte