Web Scraping with Python in 2026: Tools, Tutorial & Real Android Chrome for Hard Sites
Web scraping with Python means programmatically extracting structured data from websites using libraries ranging from requests and BeautifulSoup for static pages, up to Damru running genuine Android Chrome for bot-protected, mobile-rendered targets.
Python is the dominant language for web data scraping — and for good reason. Readable syntax, a deep ecosystem of parsing and browser-automation libraries, and strong async support with asyncio make it the natural choice for everything from a quick one-off data pull to a production-grade scraping pipeline. The key is choosing the right tool for the target site’s complexity, then knowing precisely when to upgrade.
This guide moves through the Python scraping stack in order of complexity, then walks through a complete code example using Damru — the open-source Android-native framework for sites that defeat conventional scrapers.
Why Python and Web Scraping Go Hand-in-Hand
Python’s web scraping ecosystem is uniquely mature. requests makes HTTP a three-liner. lxml and BeautifulSoup parse messy HTML without fuss. Playwright and Selenium drive real browsers for single-page applications. asyncio enables concurrent scraping across dozens of pages simultaneously. And at the advanced end, Damru connects Playwright to a real Android Chrome instance for fingerprint-sensitive targets, bridging the gap between browser automation and genuine mobile browsing.
No other language offers this full spectrum in a single, coherent ecosystem.
Python Web Data Scraping Tool Comparison
| Tool | Best For | JS Rendering | Bot-Protection Resistance | Mobile Fingerprint | Open Source |
|---|---|---|---|---|---|
| requests + BeautifulSoup | Static HTML pages, RSS, APIs | No | Low | No | Yes |
| httpx + parsel | Async static scraping | No | Low | No | Yes |
| Playwright (desktop Chromium) | JS-rendered SPAs, dynamic pages | Yes | Medium | No (desktop UA) | Yes |
| Selenium + undetected-chromedriver | JS pages, basic evasion | Yes | Medium | No | Yes |
| SeleniumBase (UC mode) | QA + medium-difficulty evasion | Yes | Medium-High | No | Yes |
| Damru | Bot-protected, mobile-rendered sites | Yes (real Android Chrome) | High (authentic Android fingerprint) | Yes — real | Yes |
Step 1 — Static Pages: requests + BeautifulSoup
For pages that deliver all content in the initial HTML response, requests plus BeautifulSoup is the fastest, most portable approach:
import requests
from bs4 import BeautifulSoup
response = requests.get(
"https://example.com/products",
headers={"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"}
)
soup = BeautifulSoup(response.text, "lxml")
titles = [tag.get_text(strip=True) for tag in soup.select("h2.product-title")]
print(titles)
This approach fails on sites that render content client-side via JavaScript frameworks (React, Vue, Angular), return captcha challenges to non-browser User-Agents, or employ TLS fingerprinting to reject non-browser HTTP clients.
Step 2 — JavaScript-Rendered Pages: Playwright
When a page loads its data after the initial HTML via JavaScript, you need a real browser engine:
import asyncio
from playwright.async_api import async_playwright
async def scrape_js_page():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://example.com/products")
await page.wait_for_selector("h2.product-title")
titles = await page.locator("h2.product-title").all_inner_texts()
await browser.close()
return titles
print(asyncio.run(scrape_js_page()))
Desktop Playwright resolves most JavaScript rendering challenges. However, sites protected by Cloudflare Bot Management, DataDome, PerimeterX, Kasada, or Akamai Bot Manager detect the HeadlessChrome browser signature, the desktop TLS fingerprint, and the absence of expected mobile sensor APIs, triggering blocks or captcha challenges.
Step 3 — Bot-Protected Sites: Damru with Real Android Chrome
For bot-protected, mobile-rendered pages, Damru runs genuine Android Chrome via Redroid — giving every request an authentic mobile fingerprint that advanced detectors cannot distinguish from a real device.
Install
pip install damru
Complete Code Walkthrough
import asyncio
from damru import AsyncDamru
async def scrape_protected_site(url: str) -> list[dict]:
async with AsyncDamru() as browser:
# 1. AsyncDamru starts Redroid (Android-in-Docker) and
# launches real Chrome for Android inside the container.
# CDP is wired up automatically — no manual setup needed.
page = await browser.new_page()
# 2. page.goto() issues the request from genuine Android Chrome.
# TLS handshake, User-Agent, Accept-Language, sec-ch-ua headers
# all originate from a real Android build, not a spoofed desktop.
await page.goto(url, wait_until="networkidle")
# 3. Standard Playwright selectors work identically.
# Damru is transparent at the API layer — same syntax,
# different browser underneath.
await page.wait_for_selector(".product-card", timeout=15_000)
cards = page.locator(".product-card")
count = await cards.count()
results = []
for i in range(count):
card = cards.nth(i)
name = await card.locator(".name").inner_text()
price = await card.locator(".price").inner_text()
results.append({"name": name.strip(), "price": price.strip()})
# 4. Exiting the context manager stops the Redroid container cleanly.
return results
data = asyncio.run(
scrape_protected_site("https://bot-protected-example.com/products")
)
for row in data:
print(row)
Line-by-Line Explanation
| Lines | What happens |
|---|---|
async with AsyncDamru() | Starts the Redroid Docker container; launches Android Chrome; establishes CDP. |
browser.new_page() | Opens a tab in real Android Chrome — not a desktop browser with patched APIs. |
page.goto(url) | Request sent with authentic Android Chrome TLS fingerprint, real sec-ch-ua, genuine mobile UA string. |
page.wait_for_selector(...) | Standard Playwright selector wait — identical to desktop Playwright usage. |
card.locator(...).inner_text() | Data extraction using Playwright locators — no change in syntax from standard usage. |
Context manager __aexit__ | Container and Chrome instance stop cleanly; no orphaned Docker processes. |
Full API reference, Docker prerequisites (Linux host, KVM support, Docker), and multi-page concurrency examples: github.com/akwin1234/damru.
Handling Pagination, Rate-Limiting, and Data Storage
Once you can load a protected page, the rest of a scraping pipeline is standard Python:
import csv, asyncio
from damru import AsyncDamru
async def scrape_all_pages(base_url: str, total_pages: int):
async with AsyncDamru() as browser:
all_items = []
for page_num in range(1, total_pages + 1):
page = await browser.new_page()
await page.goto(f"{base_url}?page={page_num}", wait_until="networkidle")
items = await page.locator(".item").all_inner_texts()
all_items.extend(items)
await page.close()
await asyncio.sleep(1.5) # polite crawl delay
return all_items
rows = asyncio.run(scrape_all_pages("https://example.com/catalog", total_pages=5))
with open("output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerows([[r] for r in rows])
A polite crawl delay (asyncio.sleep) between pages reduces server load and avoids triggering rate-limit responses — good practice regardless of the scraping tool.
Ethical and Legal Considerations for Python Web Scraping
Python and web scraping intersect with a real legal and ethical landscape. Before scraping any site:
- Check
robots.txt— theDisallowdirectives indicate which paths the site owner asks crawlers to skip. - Review the Terms of Service — many sites prohibit automated access to data.
- Avoid personal data — scraping personal information without a lawful basis raises GDPR, CCPA, and other privacy law concerns.
- Rate-limit your requests — aggressive crawl rates can constitute a denial-of-service impact on the target server.
- Consult applicable law — the CFAA (US), Computer Misuse Act (UK), and equivalent statutes in your jurisdiction are relevant.
Damru is designed for legitimate use cases: research, QA testing, competitive price monitoring on publicly available data, and mobile-browser compatibility analysis. It is not intended for unauthorized access, credential attacks, or bypassing authentication systems.
When to Use Which Python Scraping Tool
- Static site, no bot protection →
requests+BeautifulSoup - Async static scraping →
httpx+parsel - JavaScript-rendered, basic protection →
Playwright(desktop headless) - JavaScript-rendered, medium protection →
SeleniumBaseUC mode - Bot-protected, mobile fingerprint required, or Android-specific rendering → Damru
For deeper context on stealth and antidetect tooling:
- Playwright Stealth Alternative — Damru vs playwright-stealth
- Free Open-Source Antidetect Browser Comparison
- Damru on GitHub
Frequently Asked Questions
What is the best Python library for web scraping in 2026?
It depends on the site. For static HTML, requests + BeautifulSoup is fastest. For JavaScript-rendered pages, Playwright leads the field. For bot-protected or mobile-rendered targets where fingerprinting defeats desktop browsers, Damru provides genuine Android Chrome automation through the same Playwright Python API.
How do I scrape JavaScript-rendered pages with Python? Use a headless browser library — Playwright or Selenium — that executes JavaScript and waits for dynamic content to load. For sites that block headless desktop browsers, Damru runs real Android Chrome via Redroid, bypassing desktop-browser fingerprint detection at the TLS and browser-engine layers.
What makes Damru more effective on bot-protected sites? Damru runs genuine Android Chrome inside a Docker container, not a desktop browser with patched JavaScript APIs. Every signal a bot detector reads — TLS fingerprint, User-Agent, WebGL renderer, sensor availability — comes from a real Android environment. There is no patch layer for detectors to identify.
Does Damru support Python async and concurrent scraping?
Yes. AsyncDamru is an async context manager built on asyncio and async_playwright. Multiple new_page() calls within a single AsyncDamru session support concurrent tab scraping.
Is Damru free to use in commercial projects? Damru is released under the PolyForm Noncommercial 1.0.0 license, which is free for personal, educational, and noncommercial use; commercial use — including paid scraping, paid automation, or SaaS — requires a separate commercial license. Always verify that your specific scraping use case complies with the target site’s Terms of Service and applicable law in your jurisdiction.
Related
- Survey every option in the Python web scraping libraries comparison and the broader field of Python browser automation.
- When one machine is not enough, move to web scraping at scale with device pooling.
- Understand the TLS fingerprinting checks that defeat plain HTTP clients.
- Install Damru with
pipto handle bot-protected pages, and manage the worker pool from the instance manager.