How to Find Broken Images on Your Website After Media Cleanup (Python Script)

The Problem: Media Cleanup Gone Wrong

You’ve just finished cleaning up your WordPress media library – removing thousands of unused images, converting formats (JPG to AVIF, PNG to WebP), and optimizing file sizes. Everything looks great until you start finding broken images scattered across your site.

Sound familiar? This exact scenario happened with a recent client. After an aggressive media cleanup using WordPress plugins, we discovered broken images hiding in places the cleanup tools couldn’t detect:

Images embedded in Gravity Forms HTML fields
Background images in custom CSS
Images in old blog posts and custom fields
Generated image sizes that got deleted

The challenge: Most broken link checkers focus on URLs, not images. And manually checking hundreds of pages for broken images is impossible.

The solution: A Python script that automatically scans your entire website for broken images using your XML sitemap.

Why Existing Tools Fall Short

WordPress media cleaner plugins are aggressive but imperfect:

Can’t detect images in HTML fields (forms, custom fields)
Miss images referenced in CSS files
Don’t account for all the places WordPress stores image references
Often delete images that are actually in use

Standard broken link checkers:

Focus on page URLs, not image assets
Don’t crawl background images in CSS
Can’t systematically check every page on large sites
Miss images loaded via JavaScript

Manual checking:

Time-consuming and error-prone
Impossible on sites with hundreds of pages
Easy to miss images in hidden sections
No systematic way to track progress

Our Automated Solution: Python Image Checker

This script systematically crawls your entire website using your XML sitemap and checks every image it finds – including background images in CSS.

What the Script Detects

Image sources it checks:

<img> tag src attributes
CSS background-image properties (inline styles)
CSS background images in <style> blocks
Images in any HTML content

Where it looks:

Every page listed in your XML sitemap
All sub-sitemaps (if you have a sitemap index)
Both visible and CSS-referenced images

What it reports:

✅ Working images (with URLs)
❌ Broken images (with HTTP status codes)
📍 Which page each broken image appears on

Prerequisites and Setup

Requirements

Python version: Python 3.7 or higher (we tested with Python 3.11.13)

Required packages:

bash

pip install requests beautifulsoup4 lxml

Website requirements:

XML sitemap available (usually at /sitemap.xml or /sitemap_index.xml)
Website accessible via HTTPS

For WordPress sites without sitemaps: Install Yoast SEO plugin – it automatically generates and maintains XML sitemaps.

Important: Test Safely

⚠️ Always test on staging first: We used Pantheon’s development instances to test the media cleanup and image scanning before touching the live site.

🔄 Clear all caching: Before and after running scans, clear:

WordPress caching (if using caching plugins)
CDN cache (Cloudflare, etc.)
Server-level caching (Redis, Varnish)

📊 Run multiple scans: We ran the script several times to ensure consistent results before making final decisions.

The Complete Python Script

python

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import re
import xml.etree.ElementTree as ET

# User agent to avoid being blocked
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

# MODIFY THIS: Your website's sitemap URL
root_sitemap = "https://www.YOURURLHERE.com/sitemap_index.xml"

# Track what we've already checked to avoid duplicates
visited_pages = set()
checked_images = set()

def extract_background_urls(style_content):
    """Extract URLs from CSS background-image properties"""
    return re.findall(r'url\(["\']?(.*?)["\']?\)', style_content)

def check_image(img_url, page_url):
    """Check if an image URL returns a successful response"""
    # Convert relative URLs to absolute
    full_img_url = urljoin(page_url, img_url)
    
    # Skip if we've already checked this image
    if full_img_url in checked_images:
        return
    checked_images.add(full_img_url)
    
    try:
        resp = requests.get(full_img_url, headers=headers, timeout=10)
        if resp.status_code != 200:
            print(f"❌ Broken image on {page_url}: {full_img_url} (Status {resp.status_code})")
        else:
            print(f"✅ OK image: {full_img_url}")
    except Exception as e:
        print(f"❌ Failed to fetch {full_img_url} on {page_url}: {e}")

def get_urls_from_sitemap(xml_url):
    """Extract all URLs from an XML sitemap"""
    try:
        resp = requests.get(xml_url, headers=headers, timeout=10)
        resp.raise_for_status()
        root = ET.fromstring(resp.text)
        
        # Handle XML namespace
        ns = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
        return [el.text for el in root.findall('.//ns:loc', ns)]
    except Exception as e:
        print(f"⚠️ Failed to parse sitemap: {xml_url} — {e}")
        return []

# Step 1: Load main sitemap (could be index or single sitemap)
print(f"🔍 Loading sitemap: {root_sitemap}")
sitemap_urls = get_urls_from_sitemap(root_sitemap)

# Step 2: Get all individual page URLs from each sub-sitemap
all_page_urls = []
for sitemap_url in sitemap_urls:
    # Check if this is another sitemap or a page URL
    if 'sitemap' in sitemap_url.lower() and sitemap_url.endswith('.xml'):
        # It's a sub-sitemap, extract URLs from it
        all_page_urls.extend(get_urls_from_sitemap(sitemap_url))
    else:
        # It's a page URL, add it directly
        all_page_urls.append(sitemap_url)

print(f"🔍 Found {len(all_page_urls)} URLs to scan for images.")

# Step 3: Crawl each page for broken images
for page_url in all_page_urls:
    if page_url in visited_pages:
        continue
    visited_pages.add(page_url)

    print(f"📄 Scanning page: {page_url}")

    try:
        response = requests.get(page_url, headers=headers, timeout=10)
        response.raise_for_status()

        # Only process HTML content
        content_type = response.headers.get("Content-Type", "")
        if "text/html" not in content_type:
            print(f"⏭️ Skipping non-HTML content: {page_url}")
            continue

        soup = BeautifulSoup(response.text, "html.parser")
    except Exception as e:
        print(f"⚠️ Failed to fetch {page_url}: {e}")
        continue

    # Check all <img> tags
    for img in soup.find_all("img"):
        src = img.get("src")
        if src:
            check_image(src, page_url)

    # Check inline style attributes for background images
    for tag in soup.find_all(style=True):
        urls = extract_background_urls(tag['style'])
        for bg_url in urls:
            check_image(bg_url, page_url)

    # Check <style> blocks for background images
    for style_tag in soup.find_all("style"):
        if style_tag.string:
            urls = extract_background_urls(style_tag.string)
            for bg_url in urls:
                check_image(bg_url, page_url)

print(f"\n✅ Scan complete! Checked {len(checked_images)} unique images across {len(visited_pages)} pages.")

How to Use the Script

Step 1: Install Dependencies

bash

# Install required Python packages
pip install requests beautifulsoup4 lxml

Step 2: Prepare the Script

Copy the script into a file named check_images.py
Modify the sitemap URL: Change https://www.YOURURLHERE.com/sitemap_index.xml to your actual sitemap URL
Save the file

Step 3: Run the Script

bash

# Navigate to the directory containing the script
cd /path/to/your/script

# Run the image checker
python check_images.py

Step 4: Interpret the Results

✅ Working images: Show as “OK image” with the full URL ❌ Broken images: Show the page where they’re found, the broken URL, and HTTP status code

Example output:

📄 Scanning page: https://yoursite.com/about/
✅ OK image: https://yoursite.com/wp-content/uploads/hero-image.jpg
❌ Broken image on https://yoursite.com/about/: https://yoursite.com/old-logo.png (Status 404)

Understanding the Results

Common HTTP Status Codes

404 – Not Found: Image file was deleted or moved 403 – Forbidden: Permission issue or hotlink protection 500 – Server Error: Server problem loading the image Timeout errors: Server too slow or image too large

Prioritizing Fixes

Critical broken images:

Images on homepage or key landing pages
Product images on e-commerce sites
Images in active marketing campaigns
Logo or branding images

Medium priority:

Images in blog posts or content pages
Background images that don’t affect functionality
Images in archived content

Low priority:

Images in very old blog posts
Decorative images that don’t impact user experience
Images in draft or private content

Real-World Case Study: Media Cleanup Success

The situation: Client had 15,000 images in WordPress media library, many unused after years of uploads.

The process:

Staging setup: Used Pantheon dev environment for testing
Initial cleanup: Ran media cleaner plugin (deleted ~8,000 images)
First scan: Found 47 broken images the plugin missed
Restoration: Restored 12 critical images from backups
Final scan: Verified all essential images working
Live deployment: Repeated process on production with confidence

The results:

53% reduction in media library size
Faster site performance
Zero broken images on live site
Clear documentation of what was removed

Important Warnings and Limitations

⚠️ Performance Impact

This script will make many requests to your website:

One request per page in your sitemap
One request per unique image found
Can generate hundreds or thousands of requests

Use responsibly:

Run during low-traffic periods
Consider adding delays between requests for large sites
Monitor your server resources while running

🚨 Liability Disclaimer

We provide this script as-is:

No warranty or guarantee of functionality
Not liable for any damages or misuse
Test thoroughly before relying on results
Your mileage may vary based on site configuration

🔍 Script Limitations

What it doesn’t check:

Images loaded via JavaScript after page load
Images behind authentication/login walls
Images with dynamic URLs (some CDN configurations)
Images in iframes or embedded content

Potential false positives:

Images that require specific headers or cookies
Images behind geographic restrictions
Lazy-loaded images with complex loading logic

Advanced Customization Options

Add Request Delays (For Large Sites)

python

import time

# Add this after each page request
time.sleep(1)  # Wait 1 second between pages

Filter Specific Image Types

python

# Only check certain file extensions
def should_check_image(url):
    allowed_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.avif']
    return any(url.lower().endswith(ext) for ext in allowed_extensions)

# Use in check_image function
if should_check_image(full_img_url):
    # ... existing check logic

Export Results to CSV

python

import csv

broken_images = []

# Modify the broken image print statement
def log_broken_image(page_url, img_url, status):
    broken_images.append({
        'page': page_url,
        'image': img_url, 
        'status': status
    })
    print(f"❌ Broken image on {page_url}: {img_url} (Status {status})")

# At the end of the script
with open('broken_images.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=['page', 'image', 'status'])
    writer.writeheader()
    writer.writerows(broken_images)

Best Practices for Media Management

Before Media Cleanup

Full site backup (files and database)
Document current media usage with this script
Test on staging environment thoroughly
Clear all caches before and after

During Cleanup

Use conservative settings on media cleaner plugins
Run multiple scans to verify consistency
Keep deleted files in trash temporarily
Document what you’re removing

After Cleanup

Run image checker script immediately
Test critical user journeys (checkout, contact forms, etc.)
Check mobile responsiveness (images often break differently on mobile)
Monitor for 24-48 hours before considering cleanup complete

When to Get Professional Help

DIY if:

You’re comfortable with command line tools
Your site has < 1,000 pages
You have good backups and staging environment
You can afford some trial and error

Get expert help if:

Large enterprise website with complex media requirements
E-commerce site where broken images mean lost sales
No staging environment or reliable backups
Need integration with existing development workflows
Require custom reporting or monitoring solutions

Professional media management services include:

Comprehensive media audits and cleanup strategies
Custom scripts for specific CMS or hosting environments
Integration with development and deployment workflows
Ongoing monitoring and maintenance
Training for internal teams

Dealing with complex WordPress media management or need custom website optimization scripts? Contact Knihter for professional WordPress development and optimization services. We specialize in technical solutions for media management, performance optimization, and systematic website maintenance.

Related Services: