Lädt...

🔧 Building a Headless URL Checker for Large-Scale Web Data Screening


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

When dealing with massive datasets of URLs—especially in industries like sports analytics, finance, or research—it’s not uncommon to encounter a frustrating pattern: a web page loads, appears valid to the server, but contains a polite “Sorry, we couldn’t find what you’re looking for” message. These kinds of false positives are time-consuming to verify manually, especially when the dataset spans tens of thousands of links.

To solve this problem, I built a headless URL checker that scans web pages for meaningful content, distinguishing between truly valuable pages and placeholder or “no data” pages. This post outlines how the tool works, the features that make it cloud-ready and scalable, and how it handles the challenges of bot detection.

The Problem: Valid URLs, Useless Content

The URLs I worked with pointed to profile pages—some of which loaded correctly but returned a message like:

"The Graded Stakes Profile you were searching for could not be found"
These pages technically loaded (i.e., status code 200), but they didn’t actually contain the content I needed. I needed a way to automate the process of identifying these cases so I could focus only on URLs with actual data.

The Solution: Headless URL Checker

The solution is a Python-based script (url_checker_HEADLESS.py) that uses Selenium with headless Chrome to programmatically load and inspect pages.

Main Capabilities:
Reads from a CSV: Input is a master file (Equibase_URLS.csv) containing ~30,000 URLs.
Searches page content: Scans the page for a specific error message indicating no relevant data.
Saves results: Records either "yes" (content found) or "no" (error message detected) in an output CSV.
Here’s the logic in a nutshell:

If the page contains the "could not be found" message → record "no".
If the message is absent → record "yes", indicating this URL contains data we are interested in.

Sample code:

Image description

Image description

Image description

Built for the Cloud (and for Stealth)

Scraping at scale comes with risks: IP blocking, captchas, and bot protection mechanisms. This checker includes several countermeasures:

  • Headless Chrome Browsing: Lightweight and fast for cloud-based execution.
  • Randomised User Agents: Prevents fingerprinting.
  • Session Cookie Injection: Uses a cookies.pkl file generated from a real browser session via generate_cookies.py.
  • Random Delays: Introduces human-like delays between URL batches.
  • Manual Captcha Mode: Supports fallback if a captcha appears during execution.
  • Batch Processing: Splits the workload across smaller CSV chunks to avoid memory or session issues.

Output: Simple, Clear, and Actionable

Each row in the output CSV includes:

The original URL.
A "yes" or "no" label indicating whether the page is worth further inspection. i.e.,:

Image description

Final Thoughts

This tool has already saved me hours of manual inspection and allowed me to filter out tens of thousands of dead-end URLs with high precision. Whether you’re dealing with scraped web data or any large volume of dynamic content, having a smart, headless filter like this in your toolkit can make your workflow significantly more efficient—and a lot less painful.

Full repo can be found here: https://github.com/TheOxfordDeveloper/URL-checker.git

...

🔧 What Is a Headless Browser and The Best Headless Browser for Scraping


📈 29.33 Punkte
🔧 Programmierung

🔧 Chrome’s Headless mode gets an upgrade: introducing `--headless=new`


📈 29.33 Punkte
🔧 Programmierung

🐧 Running a headless VM inside a headless server


📈 29.33 Punkte
🐧 Linux Tipps

🎥 Large Language Models: How Large is Large Enough?


📈 24.76 Punkte
🎥 Video | Youtube

🔧 Building a Full-Stack Resume Screening Application With AI


📈 23.68 Punkte
🔧 Programmierung

📰 Url-Status-Checker - Tool For Swiftly Checking The Status Of URLs


📈 22.33 Punkte
📰 IT Security Nachrichten

📰 Report calls for web pre-screening to end UK’s child abuse ‘explosion’


📈 21.73 Punkte
📰 IT Security Nachrichten

🔧 WAF Checker: A Simple Web Tool for Testing Your Web Application Firewall


📈 21.28 Punkte
🔧 Programmierung

📰 Dow Jones’ high-risk screening watchlist data exposed online


📈 21.22 Punkte
📰 IT Security Nachrichten

📰 Dow Jones’ high-risk screening watchlist data exposed online


📈 21.22 Punkte
📰 IT Security Nachrichten

📰 Dow Jones’ high-risk screening watchlist data exposed online


📈 21.22 Punkte
📰 IT Security Nachrichten

🔧 WordPress Headless + CPT + ACF: Building a Flexible Content Platform


📈 20.04 Punkte
🔧 Programmierung

🔧 Save Time Building Jamstack / Headless CMS Backed Websites


📈 20.04 Punkte
🔧 Programmierung

🔧 Building agency website with headless BCMS and NextJs


📈 20.04 Punkte
🔧 Programmierung

🔧 The Future Is Headless: Building Modern E-Commerce Experiences With Magento 2 and GraphQL


📈 20.04 Punkte
🔧 Programmierung

🔧 The Future Is Headless: Building Modern E-Commerce Experiences With Magento 2 and GraphQL


📈 20.04 Punkte
🔧 Programmierung

🔧 Building a modern landing page with Next.js and One Entry Headless CMS


📈 20.04 Punkte
🔧 Programmierung

🔧 Building a WhatsApp Chatbot with Wix Headless API: A Developer's Journey


📈 20.04 Punkte
🔧 Programmierung

🔧 5+ Tips For Building Custom Headless CMS


📈 20.04 Punkte
🔧 Programmierung

🔧 Building Future-Proof High-Performance Websites With Astro Islands And Headless CMS


📈 20.04 Punkte
🔧 Programmierung

🔧 The JAMstack Revolution – Building with Hugo and Headless CMS


📈 20.04 Punkte
🔧 Programmierung

🔧 Building Headless Components in Vue — The Right Way


📈 20.04 Punkte
🔧 Programmierung

🔧 Headless WordPress with React : Building a Website


📈 20.04 Punkte
🔧 Programmierung

🔧 Building a Headless CMS with MERN and Strapi


📈 20.04 Punkte
🔧 Programmierung

🔧 Building Dynamic Websites with Headless CMS and React


📈 20.04 Punkte
🔧 Programmierung

🔧 My Journey to Building Flexilla: Headless interactive component library


📈 20.04 Punkte
🔧 Programmierung

🔧 Day-11:Building an Age Category Checker using JavaScript If-Else Statements.


📈 19.8 Punkte
🔧 Programmierung

🔧 Building an Availability Checker for Refurbished Steam Decks in Europe


📈 19.8 Punkte
🔧 Programmierung

🔧 Building an Availability Checker for Refurbished Steam Decks in Europe


📈 19.8 Punkte
🔧 Programmierung

🔧 Intro. to Web3: Building a Cardano Wallet Checker with JavaScript.


📈 19.8 Punkte
🔧 Programmierung

🔧 Building a Password Strength Checker in Python


📈 19.8 Punkte
🔧 Programmierung

🔧 Building a Palindrome Checker 🚀


📈 19.8 Punkte
🔧 Programmierung

🔧 Building a Real-Time Online Status Checker with React using useSyncExternalStore


📈 19.8 Punkte
🔧 Programmierung