🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security

800+ IT News als RSS Feed abonnieren

Thema auswählen:

📚 Cleaning Up Confluence Chaos: A Python and BERTopic Quest

🕛 Zeit seit Veröffentlichung: 347 Tage, 12 Stunden 21 Minuten
📆 Veröffentlicht am: 29.04.2023 um 06:16 Uhr
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

A tale of taming unruly documents to create the ultimate GPT-based chatbot

Introduction:

Picture this: you’re at a rapidly growing tech company, and you’ve been given the mission to create a state-of-the-art chatbot using the mind-blowing GPT technology. This chatbot is destined to become the company’s crown jewel, a virtual oracle that’ll answer questions based on the treasure trove of knowledge stored in your Confluence spaces. Sounds like a dream job, right?

But, as you take a closer look at the Confluence knowledge base, reality hits. It’s a wild jungle of empty/incomplete pages, irrelevant documents and duplicate content. It’s like someone dumped a thousand jigsaw puzzles into a giant blender and pressed “start.” And now, it’s your job to clean up this mess before you can even think about building that amazing chatbot.

Luckily for you, in this article, we’ll embark on a thrilling journey to conquer the Confluence chaos, using the power of Python and BERTopic to identify and eliminate those annoying outliers. So, buckle up and get ready to transform your knowledge base into the perfect training ground for your cutting-edge GPT-based chatbot.

The Manual Approach and the Heuristic Temptation

As you face the daunting task of cleaning up your Confluence knowledge base, you might consider diving in manually, sorting through each document one by one. However, the manual approach is slow, labor-intensive, and error-prone. After all, even the most meticulous employee can overlook important details or misjudge the relevance of a document.

With your knowledge of Python, you might be tempted to create a heuristic-based solution, using a set of predefined rules to identify and eliminate outliers. While this approach is faster than manual cleanup, it has its limitations. Heuristics can be rigid and struggle to adapt to the complex and ever-evolving nature of your Confluence spaces, often leading to suboptimal results.

Python and BERTopic — The Powerful Duo for Confluence Cleanup

Enter Python and BERTopic, a powerful combination that can help you tackle the challenge of cleaning up your Confluence knowledge base more effectively. Python is a versatile programming language, while BERTopic is an advanced topic modeling library that can analyze your documents and group them based on their underlying topics.

In the next paragraphs, we’ll explore how Python and BERTopic can work together to automate the process of identifying and eliminating outliers in your Confluence spaces. By harnessing their combined powers, you’ll save time and resources while increasing the accuracy and effectiveness of your cleanup efforts.

The Python-BERTopic Project — A Step-by-Step Guide

Alright, from this point on, I’ll walk you through the process of creating a Python script using BERTopic to identify and eliminate outliers in your Confluence knowledge base. The goal is to generate a ranked list of documents based on their “unrelatedness” score (which we’ll define later). The final output will consist of the document’s title, a preview of the text (first 100 characters), and the unrelatedness score. The final output will appear as follows:

(Title: “AI in Healthcare”, Preview: “Artificial intelligence is transforming…”, Unrelatedness: 0.95)
(Title: “Office Birthday Party Guidelines”, Preview: “To ensure a fun and safe…”, Unrelatedness: 0.8)
The essential steps in this process include:

Connect to Confluence and download documents: establish a connection to your Confluence account and fetch the documents for processing. This section provides guidance on setting up the connection, authenticating, and downloading the necessary data.
HTML processing and text extraction using Beautiful Soup: use Beautiful Soup, a powerful Python library, to manage HTML content and extract the text from Confluence documents. This step involves cleaning up the extracted text, removing unwanted elements, and preparing the data for analysis.
Apply BERTopic and create the ranking: with the cleaned-up text in hand, apply BERTopic to analyze and group the documents based on their underlying topics. After obtaining the topic representations, calculate the “unrelatedness” measure for each document and create a ranking to identify and eliminate outliers in your Confluence knowledge base.

Confluence Connection and Documents Download

Finally the code. Here, we’ll start downloading documents from a Confluence space, we’ll then process the HTML content, and we’ll extract the text for the next phase (BERTopic!).

First, we need to connect to Confluence via API. Thanks to the atlassian-python-api library, that can be done with a few lines of code. If you don’t have an API token for Atlassian, read this guide to set that up.

import os
import re
from atlassian import Confluence
from bs4 import BeautifulSoup

# Set up Confluence API client
confluence = Confluence(
    url='YOUR_CONFLUENCE URL',
    username="YOUR_EMAIL",
    password="YOUR_API_KEY",
    cloud=True)

# Replace SPACE_KEY with the desired Confluence space key
space_key = 'YOUR_SPACE'

def get_all_pages_from_space_with_pagination(space_key):
    limit = 50
    start = 0
    all_pages = []

    while True:
        pages = confluence.get_all_pages_from_space(space_key, start=start, limit=limit)
        if not pages:
            break

        all_pages.extend(pages)
        start += limit

    return all_pages


pages = get_all_pages_from_space_with_pagination(space_key)

After fetching the pages, we’ll create a directory for the text files, extract the pages’ content and save the text content to individual files:

# Function to sanitize filenames
def sanitize_filename(filename):
    return "".join(c for c in filename if c.isalnum() or c in (' ', '.', '-', '_')).rstrip()

# Create a directory for the text files if it doesn't exist
if not os.path.exists('txt_files'):
   os.makedirs('txt_files')

# Extract pages and save to individual text files
for page in pages:
   page_id = page['id']
   page_title = page['title']

   # Fetch the page content
   page_content = confluence.get_page_by_id(page_id, expand='body.storage')

   # Extract the content in the "storage" format
   storage_value = page_content['body']['storage']['value']

   # Clean the HTML tags to get the text content
   text_content = process_html_document(storage_value)
   file_name = f'txt_files/{sanitize_filename(page_title)}_{page_id}.txt'
   with open(file_name, 'w', encoding='utf-8') as txtfile:
       txtfile.write(text_content)

The function process_html_document carries out all the necessary cleaning tasks to extract the text from the downloaded pages while maintaining a coherent format. The extent to which you want to refine this process depends on your specific requirements. In this case, we focus on handling tables and lists to ensure that the resulting text document retains a format similar to the original layout.

import spacy

nlp = spacy.load("en_core_web_sm")

def html_table_to_text(html_table):
    soup = BeautifulSoup(html_table, "html.parser")

    # Extract table rows
    rows = soup.find_all("tr")

    # Determine if the table has headers or not
    has_headers = any(th for th in soup.find_all("th"))

    # Extract table headers, either from the first row or from the <th> elements
    if has_headers:
        headers = [th.get_text(strip=True) for th in soup.find_all("th")]
        row_start_index = 1  # Skip the first row, as it contains headers
    else:
        first_row = rows[0]
        headers = [cell.get_text(strip=True) for cell in first_row.find_all("td")]
        row_start_index = 1

    # Iterate through rows and cells, and use NLP to generate sentences
    text_rows = []
    for row in rows[row_start_index:]:
        cells = row.find_all("td")
        cell_sentences = []
        for header, cell in zip(headers, cells):
            # Generate a sentence using the header and cell value
            doc = nlp(f"{header}: {cell.get_text(strip=True)}")
            sentence = " ".join([token.text for token in doc if not token.is_stop])
            cell_sentences.append(sentence)

        # Combine cell sentences into a single row text
        row_text = ", ".join(cell_sentences)
        text_rows.append(row_text)

    # Combine row texts into a single text
    text = "\n\n".join(text_rows)
    return text

def html_list_to_text(html_list):
    soup = BeautifulSoup(html_list, "html.parser")
    items = soup.find_all("li")
    text_items = []
    for item in items:
        item_text = item.get_text(strip=True)
        text_items.append(f"- {item_text}")
    text = "\n".join(text_items)
    return text

def process_html_document(html_document):
    soup = BeautifulSoup(html_document, "html.parser")

    # Replace tables with text using html_table_to_text
    for table in soup.find_all("table"):
        table_text = html_table_to_text(str(table))
        table.replace_with(BeautifulSoup(table_text, "html.parser"))

    # Replace lists with text using html_list_to_text
    for ul in soup.find_all("ul"):
        ul_text = html_list_to_text(str(ul))
        ul.replace_with(BeautifulSoup(ul_text, "html.parser"))

    for ol in soup.find_all("ol"):
        ol_text = html_list_to_text(str(ol))
        ol.replace_with(BeautifulSoup(ol_text, "html.parser"))

    # Replace all types of <br> with newlines
    br_tags = re.compile('<br>|<br/>|<br />')
    html_with_newlines = br_tags.sub('\n', str(soup))

    # Strip remaining HTML tags to isolate the text
    soup_with_newlines = BeautifulSoup(html_with_newlines, "html.parser")

    return soup_with_newlines.get_text()

Identifying Outliers with BERTopic

In this final chapter, we’ll finally leverage BERTopic, a powerful topic modeling technique that utilizes BERT embeddings. You can learn more about BERTopic in their GitHub repository and their documentation.

Our approach to finding outliers consists of running BERTopic with different values for the number of topics. In each iteration, we’ll collect all documents that fall into the Outlier cluster (-1). The more frequently a document appears in the -1 cluster, the more likely it is to be considered an outlier. This frequency forms the first component of our unrelatedness score. BERTopic also provides a probability value for documents in the -1 cluster. We’ll calculate the average of these probabilities for each document over all the iterations. This average represents the second component of our unrelatedness score. Finally, we’ll determine the overall unrelatedness score for each document by computing the average of the two scores (frequency and probability). This combined score will help us identify the most unrelated documents in our dataset.

Here is the initial code:

import numpy as np
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.2)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Collect text and filenames from chunks in the txt_files directory
documents = []
filenames = []

for file in os.listdir('txt_files'):
    if file.endswith('.txt'):
        with open(os.path.join('txt_files', file), 'r', encoding='utf-8') as f:
            documents.append(f.read())
            filenames.append(file)

In this code block, we set up the necessary tools for BERTopic by importing the required libraries and initializing the models. We define 3 models that will be used by BERTopic:

vectorizer_model: the CountVectorizer model tokenizes the documents and creates a document-term matrix where each entry represents the count of a term in a document. It also removes English stop words from the documents to improve topic modeling performance.
representation_model: the MaximalMarginalRelevance (MMR) model diversifies the extracted topics by considering both the relevance and diversity of topics. The diversity parameter controls the trade-off between these two aspects, with higher values leading to more diverse topics.
ctfidf_model: the ClassTfidfTransformer model adjusts the term frequency-inverse document frequency (TF-IDF) scores of the document-term matrix to better represent topics. It reduces the impact of frequently occurring words across topics and enhances the distinction between topics.

We then collect the text and filenames of the documents from the ‘txt_files’ directory to process them with BERTopic in the next step.

def extract_topics(docs, n_topics):
    model = BERTopic(nr_topics=n_topics, calculate_probabilities=True, language="english",
                     ctfidf_model=ctfidf_model, representation_model=representation_model, 
                     vectorizer_model=vectorizer_model)
    topics, probabilities = model.fit_transform(docs)
    return model, topics, probabilities

def find_outlier_topic(model):
    topic_sizes = model.get_topic_freq()
    outlier_topic = topic_sizes.iloc[-1]["Topic"]
    return outlier_topic

outlier_counts = np.zeros(len(documents))
outlier_probs = np.zeros(len(documents))

# Define the range of topics you want to try
min_topics = 5
max_topics = 10

for n_topics in range(min_topics, max_topics + 1):
    model, topics, probabilities = extract_topics(documents, n_topics)
    outlier_topic = find_outlier_topic(model)

    for i, (topic, prob) in enumerate(zip(topics, probabilities)):
        if topic == outlier_topic:
            outlier_counts[i] += 1
            outlier_probs[i] += prob[outlier_topic]

In the above section, we use BERTopic to identify outlier documents by iterating through a range of topic counts from a specified minimum to a maximum. For each topic count, BERTopic extracts the topics and their corresponding probabilities. It then identifies the outlier topic and updates the outlier_counts and outlier_probs for documents assigned to this outlier topic. This process iteratively accumulates counts and probabilities, providing a measure of how often and how ‘strongly’ documents are classified as outliers.

Finally, we can compute our unrelatedness score and print the results:

def normalize(arr):
    min_val, max_val = np.min(arr), np.max(arr)
    return (arr - min_val) / (max_val - min_val)

# Average the probabilities
avg_outlier_probs = np.divide(outlier_probs, outlier_counts, out=np.zeros_like(outlier_probs), where=outlier_counts != 0)

# Normalize counts 
normalized_counts = normalize(outlier_counts)

# Compute the combined unrelatedness score by averaging the normalized counts and probabilities
unrelatedness_scores = [(i, (count + prob) / 2) for i, (count, prob) in enumerate(zip(normalized_counts, avg_outlier_probs))]
unrelatedness_scores.sort(key=lambda x: x[1], reverse=True)

# Print the filtered results
for index, score in unrelatedness_scores:
    if score > 0:
        title = filenames[index]
        preview = documents[index][:100] + "..." if len(documents[index]) > 100 else documents[index]
        print(f"Title: {title}, Preview: {preview}, Unrelatedness: {score:.2f}")
        print("\n")

And that’s it! Here you will have your list of outliers documents ranked by unrelatedness. By cleaning up your Confluence spaces and removing irrelevant content, you can pave the way for creating a more efficient and valuable chatbot that leverages your organization’s knowledge. Happy cleaning!

Did you enjoy this article? Want to stay updated on future content like this? Don’t forget to follow me on Medium to get notified about my latest articles and insights in AI, machine learning, and more. Let’s continue our learning journey together!

Cleaning Up Confluence Chaos: A Python and BERTopic Quest was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...