Cookie Consent by Free Privacy Policy Generator Aktuallisiere deine Cookie Einstellungen ๐Ÿ“Œ Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o


๐Ÿ“š Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o


๐Ÿ’ก Newskategorie: Programmierung
๐Ÿ”— Quelle: dev.to

Let's talk about GPT-4o

GPT-4o, OpenAI's latest vision-language model, excels in handling images compared to its predecessor language model GPT-4. This improved multimodal capability makes it particularly useful for processing visually complex web data that traditional scrapers often struggle with. Whether it's extracting information from blogs, live feeds, news articles, youtube videos, etc, using vision-language models provides a significant advantage over standard language models for unstructured extraction "in the wild".

In this guide, we'll explore how to scrape visual and text content from webpages and prepare the data for use with multimodal language models like GPT-4o. Then, we'll use the scraped visual and text prompt to extract specific structured data from the articles on the webpage. Lastly, we'll show how to validate the data and upsert it to a PostgreSQL database.

Set Up Your APIs

Ensure you have set the THEPIPE_API_KEY environment variable with your API key. If you don't have an API key, you can get one here or you can use The Pipe on your own server by following the local setup instructions in the documentation. Additionally, set your OPENAI_API_KEY for accessing GPT-4o. Don't have that either? Get it here.

For Windows users:

setx THEPIPE_API_KEY "your_api_key"
setx OPENAI_API_KEY "your_openai_api_key"

For Mac users:

export THEPIPE_API_KEY="your_api_key"
export OPENAI_API_KEY="your_openai_api_key"

Restart your terminal for the changes to take effect. You'll now need to install The Pipe API. Open your terminal and run the following command:

pip install thepipe_api

Extract Content from a Webpage

Use The Pipe API to extract text and images from a webpage. Here's an example of how to do this:

from thepipe_api import thepipe

# Extract multimodal content from a webpage
webpage_content = thepipe.extract("https://www.bbc.com/")

The Pipe API can handle dynamic content that adapts as you scroll (as many modern webpages contain) and automatic scrolling, ensuring that all relevant text and images are captured. This is particularly important for visual web data, which traditional scrapers often miss or misinterpret.

The result of the extraction will be a list of dictionaries, each containing the extracted content from the webpage as a hosted browser scrolls through the page. The content will include both text and images, making it suitable for use with multimodal language models like GPT-4o:

Web Extraction

Prepare the Input for GPT-4o

Next, prepare the input prompt by combining the extracted content with a user query. This will help GPT-4o understand what you want to achieve with the extracted data. In our case, we will perform a structured data extraction task from the webpage, looking to grab the articles, their contents, and any images associated with them.

# Add a user query
query = [{
    "role": "system", 
    "content": [{
            "type": "text", 
            "text": """Please extract each article from the given webpage. Do this by returning a JSON object with the key, "articles", containing a list of all the articles in the given page. For each article, provide a JSON dictionary containing the following keys: 
            title (required string),
            extracted_plaintext (required string),
            topics (required list),
            sentiment (required string),
            language (required string),
            image_description (optional string)."""
    }]
}]

# Combine the content to create the input prompt for GPT-4o
messages = webpage_content + query

Send the Input to GPT-4o

With the input prepared, you can now send it to GPT-4o using the OpenAI API. Make sure you have your OPENAI_API_KEY set in your environment variables.

from openai import OpenAI
import json

# Initialize the OpenAI client
openai_client = OpenAI()

# Send the unstructured visuals and text to GPT-4o
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={"type": "json_object"},
    temperature=0,
)
# Extract the structured JSON
response = response.choices[0].message.content
response = json.loads(response)
# Print the result
print(response)

GPT-4o will process the input prompt and return a structured JSON object containing the extracted articles, their titles, extracted plaintext, topics, sentiment, language, and image descriptions (if available). This structured data can be used for further analysis or processing (see images below for visualizations using the Beta version of The Pipe API portal).

GPT-4o Output

GPT-4o json

Putting the Data to Use

Now that you have the structured data, you can use it for various purposes, such as content analysis, summarization, sentiment analysis, or even generating new content based on the extracted information. The structured data can be easily pushed into a SQL table, a NoSQL database, or any other data storage system for further processing. For example, here, I am pushing the extracted data into a PostgreSQL database hosted on Supabase:

import supabase
sb_client = supabase.create_client(SUPABASE_URL, SUPABASE_KEY)
for section in response['sections']:
    try:
        entry = {
            'title': section['title'],
            'extracted_plaintext': section['extracted_plaintext'],
            'topics': section['topics'],
            'sentiment': section['sentiment'],
            'language': section['language'],
            'image_description': section.get('image_description', None)
        }
        sb_client.table('demo_db').insert(entry).execute()
    except:
        pass

Supabase

Handling Token Limits

When dealing with vision models, you might need to handle token limits sooner than text models. The Pipe API allows you to extract text-only content in extreme cases to avoid exceeding token limits. For more details, you can click here for a discussion on token limits for GPT-4-Vision.

# Extract text-only content from a webpage
webpage_content_text_only = thepipe.extract("https://example.com", text_only=True)
messages_text_only = webpage_content_text_only + query

Congratulations!

You've successfully scraped a webpage and extracted visually complicated unstructured data from it. This process can be extended to other types of content, such as PDFs, videos, and more, using The Pipe API. For more details on GPT-4o, check out the OpenAI announcement. If you're a developer, feel free to contribute to The Pipe on GitHub!

Happy coding! ๐Ÿš€

...



๐Ÿ“Œ Google Search with Structured Data Extraction


๐Ÿ“ˆ 34.34 Punkte

๐Ÿ“Œ Introducing GPT Crawler - Turn Any Site Into a Custom GPT With Just a URL


๐Ÿ“ˆ 34.22 Punkte

๐Ÿ“Œ Behind the Scenes of iOS Data Extraction: Exploring the Extraction Agent


๐Ÿ“ˆ 32.22 Punkte

๐Ÿ“Œ Extraction horror shooter Level Zero: Extraction will get Steam Deck support


๐Ÿ“ˆ 29.21 Punkte

๐Ÿ“Œ Turn Text Into Structured Data Using JavaScript & OpenAI's GPT


๐Ÿ“ˆ 29.08 Punkte

๐Ÿ“Œ 10 Secret GPT-4 Tips And Tricks (How To Use GPT-4)(GPT-4 Tutorial)


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ Updates fรผr GPT-3 und GPT-4: GPT im Geschwindigkeitsrausch


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ What does GPT stand for? Understanding GPT 3.5, GPT 4, and more


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ OpenAI: Keine Suche, kein GPT-5, aber GPT-4o fรผr ChatGPT und GPT-4


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ GPT-6 SHOCKS Everyone With NEW ABILITIES! (GPT5, GPT-6, GPT-7) Document Reveals ALL!


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ The Evolution of the GPT Series: A Deep Dive into Technical Insights and Performance Metrics From GPT-1 to GPT-4o


๐Ÿ“ˆ 28.04 Punkte

๐Ÿ“Œ Structured Data Linter up to 2.4.1 URL directory traversal


๐Ÿ“ˆ 27.72 Punkte

๐Ÿ“Œ Structured Data Linter bis 2.4.1 URL Directory Traversal


๐Ÿ“ˆ 27.72 Punkte

๐Ÿ“Œ Data sharing done right: Spanner Data Boost


๐Ÿ“ˆ 26.69 Punkte

๐Ÿ“Œ OpenAI slashes GPT-3.5 Turbo's cost as it prepares to ship GPT-4 Turbo with vision


๐Ÿ“ˆ 25.95 Punkte

๐Ÿ“Œ Empowering clinicians with mobile health data: Right information, right place, right time


๐Ÿ“ˆ 25.72 Punkte

๐Ÿ“Œ Empowering clinicians with mobile health data: Right information, right place, right time


๐Ÿ“ˆ 25.72 Punkte

๐Ÿ“Œ Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision


๐Ÿ“ˆ 24.88 Punkte

๐Ÿ“Œ Chat with any GPT right through your favorite text editor


๐Ÿ“ˆ 24.45 Punkte

๐Ÿ“Œ I've been writing web backends and frontends since the 90s. Finally: declarative, dynamic markup done right


๐Ÿ“ˆ 24.22 Punkte

๐Ÿ“Œ "Geographical Demand Data Extraction: Web Automation and Efficient Data Handling with Python, Selenium, and BeautifulSoup" ๐Ÿš€โœจ


๐Ÿ“ˆ 24.2 Punkte

๐Ÿ“Œ Java Tries a New Way to Use Multithreading: Structured Concurrency


๐Ÿ“ˆ 24 Punkte

๐Ÿ“Œ Java Tries a New Way to Use Multithreading: Structured Concurrency


๐Ÿ“ˆ 24 Punkte

๐Ÿ“Œ GPT-4 Turbo with Vision + Azure AI Vision


๐Ÿ“ˆ 23.86 Punkte

๐Ÿ“Œ Structured data for web developers | Search Central Lightning Talks


๐Ÿ“ˆ 23.3 Punkte

๐Ÿ“Œ Data Risk Management, Part 3: Assessing Risk Levels of Structured Versus Unstructured Data


๐Ÿ“ˆ 22.75 Punkte

๐Ÿ“Œ Decoding Data Mesh: A Structured Approach to Decentralized Customer Master Data Management


๐Ÿ“ˆ 22.75 Punkte

๐Ÿ“Œ OkeraEnsemble secures data access to structured and unstructured file data


๐Ÿ“ˆ 22.75 Punkte











matomo