📚 Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o

🕛 Zeit seit Veröffentlichung: 23 Tage, 21 Stunden 45 Minuten
📆 Veröffentlicht am: 23.05.2024 um 00:18 Uhr
💡 Newskategorie: Programmierung
🔗 Quelle: dev.to

Let's talk about GPT-4o

GPT-4o, OpenAI's latest vision-language model, excels in handling images compared to its predecessor language model GPT-4. This improved multimodal capability makes it particularly useful for processing visually complex web data that traditional scrapers often struggle with. Whether it's extracting information from blogs, live feeds, news articles, youtube videos, etc, using vision-language models provides a significant advantage over standard language models for unstructured extraction "in the wild".

In this guide, we'll explore how to scrape visual and text content from webpages and prepare the data for use with multimodal language models like GPT-4o. Then, we'll use the scraped visual and text prompt to extract specific structured data from the articles on the webpage. Lastly, we'll show how to validate the data and upsert it to a PostgreSQL database.

Set Up Your APIs

Ensure you have set the THEPIPE_API_KEY environment variable with your API key. If you don't have an API key, you can get one here or you can use The Pipe on your own server by following the local setup instructions in the documentation. Additionally, set your OPENAI_API_KEY for accessing GPT-4o. Don't have that either? Get it here.

For Windows users:

setx THEPIPE_API_KEY "your_api_key"
setx OPENAI_API_KEY "your_openai_api_key"

For Mac users:

export THEPIPE_API_KEY="your_api_key"
export OPENAI_API_KEY="your_openai_api_key"

Restart your terminal for the changes to take effect. You'll now need to install The Pipe API. Open your terminal and run the following command:

pip install thepipe_api

Extract Content from a Webpage

Use The Pipe API to extract text and images from a webpage. Here's an example of how to do this:

from thepipe_api import thepipe

# Extract multimodal content from a webpage
webpage_content = thepipe.extract("https://www.bbc.com/")

The Pipe API can handle dynamic content that adapts as you scroll (as many modern webpages contain) and automatic scrolling, ensuring that all relevant text and images are captured. This is particularly important for visual web data, which traditional scrapers often miss or misinterpret.

The result of the extraction will be a list of dictionaries, each containing the extracted content from the webpage as a hosted browser scrolls through the page. The content will include both text and images, making it suitable for use with multimodal language models like GPT-4o:

Prepare the Input for GPT-4o

Next, prepare the input prompt by combining the extracted content with a user query. This will help GPT-4o understand what you want to achieve with the extracted data. In our case, we will perform a structured data extraction task from the webpage, looking to grab the articles, their contents, and any images associated with them.

# Add a user query
query = [{
    "role": "system", 
    "content": [{
            "type": "text", 
            "text": """Please extract each article from the given webpage. Do this by returning a JSON object with the key, "articles", containing a list of all the articles in the given page. For each article, provide a JSON dictionary containing the following keys: 
            title (required string),
            extracted_plaintext (required string),
            topics (required list),
            sentiment (required string),
            language (required string),
            image_description (optional string)."""
    }]
}]

# Combine the content to create the input prompt for GPT-4o
messages = webpage_content + query

Send the Input to GPT-4o

With the input prepared, you can now send it to GPT-4o using the OpenAI API. Make sure you have your OPENAI_API_KEY set in your environment variables.

from openai import OpenAI
import json

# Initialize the OpenAI client
openai_client = OpenAI()

# Send the unstructured visuals and text to GPT-4o
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={"type": "json_object"},
    temperature=0,
)
# Extract the structured JSON
response = response.choices[0].message.content
response = json.loads(response)
# Print the result
print(response)

GPT-4o will process the input prompt and return a structured JSON object containing the extracted articles, their titles, extracted plaintext, topics, sentiment, language, and image descriptions (if available). This structured data can be used for further analysis or processing (see images below for visualizations using the Beta version of The Pipe API portal).

Putting the Data to Use

Now that you have the structured data, you can use it for various purposes, such as content analysis, summarization, sentiment analysis, or even generating new content based on the extracted information. The structured data can be easily pushed into a SQL table, a NoSQL database, or any other data storage system for further processing. For example, here, I am pushing the extracted data into a PostgreSQL database hosted on Supabase:

import supabase
sb_client = supabase.create_client(SUPABASE_URL, SUPABASE_KEY)
for section in response['sections']:
    try:
        entry = {
            'title': section['title'],
            'extracted_plaintext': section['extracted_plaintext'],
            'topics': section['topics'],
            'sentiment': section['sentiment'],
            'language': section['language'],
            'image_description': section.get('image_description', None)
        }
        sb_client.table('demo_db').insert(entry).execute()
    except:
        pass

Handling Token Limits

When dealing with vision models, you might need to handle token limits sooner than text models. The Pipe API allows you to extract text-only content in extreme cases to avoid exceeding token limits. For more details, you can click here for a discussion on token limits for GPT-4-Vision.

# Extract text-only content from a webpage
webpage_content_text_only = thepipe.extract("https://example.com", text_only=True)
messages_text_only = webpage_content_text_only + query

Congratulations!

You've successfully scraped a webpage and extracted visually complicated unstructured data from it. This process can be extended to other types of content, such as PDFs, videos, and more, using The Pipe API. For more details on GPT-4o, check out the OpenAI announcement. If you're a developer, feel free to contribute to The Pipe on GitHub!

Happy coding! 🚀

...