The Problem: Media Cleanup Gone Wrong
You’ve just finished cleaning up your WordPress media library – removing thousands of unused images, converting formats (JPG to AVIF, PNG to WebP), and optimizing file sizes. Everything looks great until you start finding broken images scattered across your site.
Sound familiar? This exact scenario happened with a recent client. After an aggressive media cleanup using WordPress plugins, we discovered broken images hiding in places the cleanup tools couldn’t detect:
- Images embedded in Gravity Forms HTML fields
- Background images in custom CSS
- Images in old blog posts and custom fields
- Generated image sizes that got deleted
The challenge: Most broken link checkers focus on URLs, not images. And manually checking hundreds of pages for broken images is impossible.
The solution: A Python script that automatically scans your entire website for broken images using your XML sitemap.
Why Existing Tools Fall Short
WordPress media cleaner plugins are aggressive but imperfect:
- Can’t detect images in HTML fields (forms, custom fields)
- Miss images referenced in CSS files
- Don’t account for all the places WordPress stores image references
- Often delete images that are actually in use
Standard broken link checkers:
- Focus on page URLs, not image assets
- Don’t crawl background images in CSS
- Can’t systematically check every page on large sites
- Miss images loaded via JavaScript
Manual checking:
- Time-consuming and error-prone
- Impossible on sites with hundreds of pages
- Easy to miss images in hidden sections
- No systematic way to track progress
Our Automated Solution: Python Image Checker
This script systematically crawls your entire website using your XML sitemap and checks every image it finds – including background images in CSS.
What the Script Detects
Image sources it checks:
<img>
tag src attributes- CSS background-image properties (inline styles)
- CSS background images in
<style>
blocks - Images in any HTML content
Where it looks:
- Every page listed in your XML sitemap
- All sub-sitemaps (if you have a sitemap index)
- Both visible and CSS-referenced images
What it reports:
- ✅ Working images (with URLs)
- ❌ Broken images (with HTTP status codes)
- 📍 Which page each broken image appears on
Prerequisites and Setup
Requirements
Python version: Python 3.7 or higher (we tested with Python 3.11.13)
Required packages:
bash
pip install requests beautifulsoup4 lxml
Website requirements:
- XML sitemap available (usually at
/sitemap.xml
or/sitemap_index.xml
) - Website accessible via HTTPS
For WordPress sites without sitemaps: Install Yoast SEO plugin – it automatically generates and maintains XML sitemaps.
Important: Test Safely
⚠️ Always test on staging first: We used Pantheon’s development instances to test the media cleanup and image scanning before touching the live site.
🔄 Clear all caching: Before and after running scans, clear:
- WordPress caching (if using caching plugins)
- CDN cache (Cloudflare, etc.)
- Server-level caching (Redis, Varnish)
📊 Run multiple scans: We ran the script several times to ensure consistent results before making final decisions.
The Complete Python Script
python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import re
import xml.etree.ElementTree as ET
# User agent to avoid being blocked
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}
# MODIFY THIS: Your website's sitemap URL
root_sitemap = "https://www.YOURURLHERE.com/sitemap_index.xml"
# Track what we've already checked to avoid duplicates
visited_pages = set()
checked_images = set()
def extract_background_urls(style_content):
"""Extract URLs from CSS background-image properties"""
return re.findall(r'url\(["\']?(.*?)["\']?\)', style_content)
def check_image(img_url, page_url):
"""Check if an image URL returns a successful response"""
# Convert relative URLs to absolute
full_img_url = urljoin(page_url, img_url)
# Skip if we've already checked this image
if full_img_url in checked_images:
return
checked_images.add(full_img_url)
try:
resp = requests.get(full_img_url, headers=headers, timeout=10)
if resp.status_code != 200:
print(f"❌ Broken image on {page_url}: {full_img_url} (Status {resp.status_code})")
else:
print(f"✅ OK image: {full_img_url}")
except Exception as e:
print(f"❌ Failed to fetch {full_img_url} on {page_url}: {e}")
def get_urls_from_sitemap(xml_url):
"""Extract all URLs from an XML sitemap"""
try:
resp = requests.get(xml_url, headers=headers, timeout=10)
resp.raise_for_status()
root = ET.fromstring(resp.text)
# Handle XML namespace
ns = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
return [el.text for el in root.findall('.//ns:loc', ns)]
except Exception as e:
print(f"⚠️ Failed to parse sitemap: {xml_url} — {e}")
return []
# Step 1: Load main sitemap (could be index or single sitemap)
print(f"🔍 Loading sitemap: {root_sitemap}")
sitemap_urls = get_urls_from_sitemap(root_sitemap)
# Step 2: Get all individual page URLs from each sub-sitemap
all_page_urls = []
for sitemap_url in sitemap_urls:
# Check if this is another sitemap or a page URL
if 'sitemap' in sitemap_url.lower() and sitemap_url.endswith('.xml'):
# It's a sub-sitemap, extract URLs from it
all_page_urls.extend(get_urls_from_sitemap(sitemap_url))
else:
# It's a page URL, add it directly
all_page_urls.append(sitemap_url)
print(f"🔍 Found {len(all_page_urls)} URLs to scan for images.")
# Step 3: Crawl each page for broken images
for page_url in all_page_urls:
if page_url in visited_pages:
continue
visited_pages.add(page_url)
print(f"📄 Scanning page: {page_url}")
try:
response = requests.get(page_url, headers=headers, timeout=10)
response.raise_for_status()
# Only process HTML content
content_type = response.headers.get("Content-Type", "")
if "text/html" not in content_type:
print(f"⏭️ Skipping non-HTML content: {page_url}")
continue
soup = BeautifulSoup(response.text, "html.parser")
except Exception as e:
print(f"⚠️ Failed to fetch {page_url}: {e}")
continue
# Check all <img> tags
for img in soup.find_all("img"):
src = img.get("src")
if src:
check_image(src, page_url)
# Check inline style attributes for background images
for tag in soup.find_all(style=True):
urls = extract_background_urls(tag['style'])
for bg_url in urls:
check_image(bg_url, page_url)
# Check <style> blocks for background images
for style_tag in soup.find_all("style"):
if style_tag.string:
urls = extract_background_urls(style_tag.string)
for bg_url in urls:
check_image(bg_url, page_url)
print(f"\n✅ Scan complete! Checked {len(checked_images)} unique images across {len(visited_pages)} pages.")
How to Use the Script
Step 1: Install Dependencies
bash
# Install required Python packages
pip install requests beautifulsoup4 lxml
Step 2: Prepare the Script
- Copy the script into a file named
check_images.py
- Modify the sitemap URL: Change
https://www.YOURURLHERE.com/sitemap_index.xml
to your actual sitemap URL - Save the file
Step 3: Run the Script
bash
# Navigate to the directory containing the script
cd /path/to/your/script
# Run the image checker
python check_images.py
Step 4: Interpret the Results
✅ Working images: Show as “OK image” with the full URL ❌ Broken images: Show the page where they’re found, the broken URL, and HTTP status code
Example output:
📄 Scanning page: https://yoursite.com/about/
✅ OK image: https://yoursite.com/wp-content/uploads/hero-image.jpg
❌ Broken image on https://yoursite.com/about/: https://yoursite.com/old-logo.png (Status 404)
Understanding the Results
Common HTTP Status Codes
404 – Not Found: Image file was deleted or moved 403 – Forbidden: Permission issue or hotlink protection 500 – Server Error: Server problem loading the image Timeout errors: Server too slow or image too large
Prioritizing Fixes
Critical broken images:
- Images on homepage or key landing pages
- Product images on e-commerce sites
- Images in active marketing campaigns
- Logo or branding images
Medium priority:
- Images in blog posts or content pages
- Background images that don’t affect functionality
- Images in archived content
Low priority:
- Images in very old blog posts
- Decorative images that don’t impact user experience
- Images in draft or private content
Real-World Case Study: Media Cleanup Success
The situation: Client had 15,000 images in WordPress media library, many unused after years of uploads.
The process:
- Staging setup: Used Pantheon dev environment for testing
- Initial cleanup: Ran media cleaner plugin (deleted ~8,000 images)
- First scan: Found 47 broken images the plugin missed
- Restoration: Restored 12 critical images from backups
- Final scan: Verified all essential images working
- Live deployment: Repeated process on production with confidence
The results:
- 53% reduction in media library size
- Faster site performance
- Zero broken images on live site
- Clear documentation of what was removed
Important Warnings and Limitations
⚠️ Performance Impact
This script will make many requests to your website:
- One request per page in your sitemap
- One request per unique image found
- Can generate hundreds or thousands of requests
Use responsibly:
- Run during low-traffic periods
- Consider adding delays between requests for large sites
- Monitor your server resources while running
🚨 Liability Disclaimer
We provide this script as-is:
- No warranty or guarantee of functionality
- Not liable for any damages or misuse
- Test thoroughly before relying on results
- Your mileage may vary based on site configuration
🔍 Script Limitations
What it doesn’t check:
- Images loaded via JavaScript after page load
- Images behind authentication/login walls
- Images with dynamic URLs (some CDN configurations)
- Images in iframes or embedded content
Potential false positives:
- Images that require specific headers or cookies
- Images behind geographic restrictions
- Lazy-loaded images with complex loading logic
Advanced Customization Options
Add Request Delays (For Large Sites)
python
import time
# Add this after each page request
time.sleep(1) # Wait 1 second between pages
Filter Specific Image Types
python
# Only check certain file extensions
def should_check_image(url):
allowed_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.avif']
return any(url.lower().endswith(ext) for ext in allowed_extensions)
# Use in check_image function
if should_check_image(full_img_url):
# ... existing check logic
Export Results to CSV
python
import csv
broken_images = []
# Modify the broken image print statement
def log_broken_image(page_url, img_url, status):
broken_images.append({
'page': page_url,
'image': img_url,
'status': status
})
print(f"❌ Broken image on {page_url}: {img_url} (Status {status})")
# At the end of the script
with open('broken_images.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['page', 'image', 'status'])
writer.writeheader()
writer.writerows(broken_images)
Best Practices for Media Management
Before Media Cleanup
- Full site backup (files and database)
- Document current media usage with this script
- Test on staging environment thoroughly
- Clear all caches before and after
During Cleanup
- Use conservative settings on media cleaner plugins
- Run multiple scans to verify consistency
- Keep deleted files in trash temporarily
- Document what you’re removing
After Cleanup
- Run image checker script immediately
- Test critical user journeys (checkout, contact forms, etc.)
- Check mobile responsiveness (images often break differently on mobile)
- Monitor for 24-48 hours before considering cleanup complete
When to Get Professional Help
DIY if:
- You’re comfortable with command line tools
- Your site has < 1,000 pages
- You have good backups and staging environment
- You can afford some trial and error
Get expert help if:
- Large enterprise website with complex media requirements
- E-commerce site where broken images mean lost sales
- No staging environment or reliable backups
- Need integration with existing development workflows
- Require custom reporting or monitoring solutions
Professional media management services include:
- Comprehensive media audits and cleanup strategies
- Custom scripts for specific CMS or hosting environments
- Integration with development and deployment workflows
- Ongoing monitoring and maintenance
- Training for internal teams
Dealing with complex WordPress media management or need custom website optimization scripts? Contact Knihter for professional WordPress development and optimization services. We specialize in technical solutions for media management, performance optimization, and systematic website maintenance.
Related Services:
- WordPress media optimization and management
- Custom Python scripts for website maintenance
- Website performance auditing and optimization
- Pantheon hosting optimization and development workflows