E-commerce price scraping is one of the most common and valuable applications of web proxies. Retailers monitor competitor prices, brands enforce MAP (Minimum Advertised Price) policies, and researchers track inflation. This guide covers building reliable, scalable price scrapers that handle anti-bot systems and deliver clean data.
The Challenges
- Anti-bot protection: Amazon, Shopify stores, Walmart, and others use sophisticated detection (Cloudflare, Akamai, PerimeterX)
- Dynamic pricing: Prices change based on location, time of day, user history, and demand
- JavaScript rendering: Many sites load prices via JavaScript after page load
- Rate limiting: Too many requests from one IP triggers blocks
- Structural changes: Sites update their HTML structure frequently, breaking selectors
Building a Basic Price Scraper
import requestsfrom bs4 import BeautifulSoupimport jsonfrom datetime import datetime
def scrape_price(url, proxy_country="us"): """Scrape product price through rotating proxy.""" proxy = f"http://USER:PASS_country-{proxy_country}@gate.zentislabs.com:7777" proxies = {"http": proxy, "https": proxy} headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", } r = requests.get(url, proxies=proxies, headers=headers, timeout=15) r.raise_for_status() soup = BeautifulSoup(r.text, "html.parser") # Generic price extraction (adapt selectors per site) price_el = soup.select_one('[data-price], .price, .product-price, [itemprop="price"]') title_el = soup.select_one('h1, [itemprop="name"], .product-title') return { "url": url, "title": title_el.get_text(strip=True) if title_el else None, "price": price_el.get_text(strip=True) if price_el else None, "currency": price_el.get("content") if price_el else None, "scraped_at": datetime.utcnow().isoformat(), }Handling JavaScript-Rendered Prices
When prices are loaded dynamically, use Playwright instead of requests:
from playwright.sync_api import sync_playwrightimport json
def scrape_dynamic_price(url): with sync_playwright() as p: browser = p.chromium.launch( proxy={"server": "http://gate.zentislabs.com:7777", "username": "USER", "password": "PASS"} ) page = browser.new_page() # Block images to save bandwidth page.route("**/*.{png,jpg,jpeg,gif,svg,webp}", lambda r: r.abort()) page.goto(url, wait_until="networkidle") # Wait for price element to appear page.wait_for_selector('[data-price], .price', timeout=10000) price = page.text_content('[data-price], .price') title = page.text_content('h1') browser.close() return {"title": title.strip(), "price": price.strip()}Structuring the Data
import refrom decimal import Decimal
def parse_price(price_str): """Extract numeric price from text like '$29.99' or '€ 1.234,56'.""" if not price_str: return None # Remove currency symbols and whitespace cleaned = re.sub(r'[^\d.,]', '', price_str.strip()) # Handle European format (1.234,56 -> 1234.56) if ',' in cleaned and '.' in cleaned: if cleaned.rindex(',') > cleaned.rindex('.'): cleaned = cleaned.replace('.', '').replace(',', '.') else: cleaned = cleaned.replace(',', '') elif ',' in cleaned: # Could be 1,234 or 12,50 parts = cleaned.split(',') if len(parts[-1]) == 2: cleaned = cleaned.replace(',', '.') else: cleaned = cleaned.replace(',', '') return float(Decimal(cleaned))
# Usageprint(parse_price("$29.99")) # 29.99print(parse_price("€ 1.234,56")) # 1234.56print(parse_price("¥12,800")) # 12800.0Scaling to Thousands of Products
import asyncioimport aiohttpfrom aiohttp_socks import ProxyConnector
async def scrape_batch(urls, max_concurrent=20): """Scrape multiple URLs concurrently with rotating proxies.""" semaphore = asyncio.Semaphore(max_concurrent) results = [] async def fetch(url): async with semaphore: proxy = "http://USER:PASS@gate.zentislabs.com:7777" connector = ProxyConnector.from_url(proxy) async with aiohttp.ClientSession(connector=connector) as session: async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as r: html = await r.text() return {"url": url, "html": html, "status": r.status} tasks = [fetch(url) for url in urls] results = await asyncio.gather(*tasks, return_exceptions=True) return [r for r in results if not isinstance(r, Exception)]
# Scrape 500 product pagesurls = [f"https://store.com/product/{i}" for i in range(1, 501)]results = asyncio.run(scrape_batch(urls))print(f"Successfully scraped {len(results)} products")Best Practices
- Rotate User-Agents: Maintain a list of 50+ real browser User-Agent strings and rotate them.
- Respect robots.txt: Check the site's robots.txt and terms of service. Scrape responsibly.
- Add delays: Random 1-5 second delays between requests reduce detection risk.
- Use residential proxies: E-commerce sites block datacenter IPs aggressively.
- Monitor success rates: Track your response codes. If 403s spike, slow down or switch proxy regions.
- Handle price formats: Different countries use different decimal and thousands separators.
💰 ZentisLabs residential proxies offer 195+ country geo-targeting with non-expiring bandwidth — ideal for monitoring prices across global markets without worrying about monthly data caps.
