Web Scraping with Python in 2026: Tools, Tutorial & Real Android Chrome for Hard Sites

Q: What is the best Python library for web scraping in 2026?

It depends on the site. For static HTML, requests + BeautifulSoup is fastest. For JavaScript-rendered pages, Playwright leads the field. For bot-protected or mobile-rendered targets where fingerprinting defeats desktop browsers, Damru provides genuine Android Chrome automation through the same Playwright Python API.

Q: Does Damru support Python async and concurrent scraping?

Yes. AsyncDamru is an async context manager built on asyncio and async_playwright. Multiple new_page() calls within a single AsyncDamru session support concurrent tab scraping.

Web scraping with Python means programmatically extracting structured data from websites using libraries ranging from requests and BeautifulSoup for static pages, up to Damru running genuine Android Chrome for bot-protected, mobile-rendered targets.

Python is the dominant language for web data scraping — and for good reason. Readable syntax, a deep ecosystem of parsing and browser-automation libraries, and strong async support with asyncio make it the natural choice for everything from a quick one-off data pull to a production-grade scraping pipeline. The key is choosing the right tool for the target site’s complexity, then knowing precisely when to upgrade.

This guide moves through the Python scraping stack in order of complexity, then walks through a complete code example using Damru — the open-source Android-native framework for sites that defeat conventional scrapers.

Why Python and Web Scraping Go Hand-in-Hand

Python’s web scraping ecosystem is uniquely mature. requests makes HTTP a three-liner. lxml and BeautifulSoup parse messy HTML without fuss. Playwright and Selenium drive real browsers for single-page applications. asyncio enables concurrent scraping across dozens of pages simultaneously. And at the advanced end, Damru connects Playwright to a real Android Chrome instance for fingerprint-sensitive targets, bridging the gap between browser automation and genuine mobile browsing.

No other language offers this full spectrum in a single, coherent ecosystem.

Python Web Data Scraping Tool Comparison

Tool	Best For	JS Rendering	Bot-Protection Resistance	Mobile Fingerprint	Open Source
requests + BeautifulSoup	Static HTML pages, RSS, APIs	No	Low	No	Yes
httpx + parsel	Async static scraping	No	Low	No	Yes
Playwright (desktop Chromium)	JS-rendered SPAs, dynamic pages	Yes	Medium	No (desktop UA)	Yes
Selenium + undetected-chromedriver	JS pages, basic evasion	Yes	Medium	No	Yes
SeleniumBase (UC mode)	QA + medium-difficulty evasion	Yes	Medium-High	No	Yes
Damru	Bot-protected, mobile-rendered sites	Yes (real Android Chrome)	High (authentic Android fingerprint)	Yes — real	Yes

Step 1 — Static Pages: requests + BeautifulSoup

For pages that deliver all content in the initial HTML response, requests plus BeautifulSoup is the fastest, most portable approach:

import requests
from bs4 import BeautifulSoup

response = requests.get(
    "https://example.com/products",
    headers={"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"}
)
soup = BeautifulSoup(response.text, "lxml")
titles = [tag.get_text(strip=True) for tag in soup.select("h2.product-title")]
print(titles)

This approach fails on sites that render content client-side via JavaScript frameworks (React, Vue, Angular), return captcha challenges to non-browser User-Agents, or employ TLS fingerprinting to reject non-browser HTTP clients.

Step 2 — JavaScript-Rendered Pages: Playwright

When a page loads its data after the initial HTML via JavaScript, you need a real browser engine:

import asyncio
from playwright.async_api import async_playwright

async def scrape_js_page():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com/products")
        await page.wait_for_selector("h2.product-title")
        titles = await page.locator("h2.product-title").all_inner_texts()
        await browser.close()
        return titles

print(asyncio.run(scrape_js_page()))

Desktop Playwright resolves most JavaScript rendering challenges. However, sites protected by Cloudflare Bot Management, DataDome, PerimeterX, Kasada, or Akamai Bot Manager detect the HeadlessChrome browser signature, the desktop TLS fingerprint, and the absence of expected mobile sensor APIs, triggering blocks or captcha challenges.

Step 3 — Bot-Protected Sites: Damru with Real Android Chrome

For bot-protected, mobile-rendered pages, Damru runs genuine Android Chrome via Redroid — giving every request an authentic mobile fingerprint that advanced detectors cannot distinguish from a real device.

Install

pip install damru

Complete Code Walkthrough

import asyncio
from damru import AsyncDamru

async def scrape_protected_site(url: str) -> list[dict]:
    async with AsyncDamru() as browser:
        # 1. AsyncDamru starts Redroid (Android-in-Docker) and
        #    launches real Chrome for Android inside the container.
        #    CDP is wired up automatically — no manual setup needed.

        page = await browser.new_page()

        # 2. page.goto() issues the request from genuine Android Chrome.
        #    TLS handshake, User-Agent, Accept-Language, sec-ch-ua headers
        #    all originate from a real Android build, not a spoofed desktop.
        await page.goto(url, wait_until="networkidle")

        # 3. Standard Playwright selectors work identically.
        #    Damru is transparent at the API layer — same syntax,
        #    different browser underneath.
        await page.wait_for_selector(".product-card", timeout=15_000)

        cards = page.locator(".product-card")
        count = await cards.count()

        results = []
        for i in range(count):
            card = cards.nth(i)
            name  = await card.locator(".name").inner_text()
            price = await card.locator(".price").inner_text()
            results.append({"name": name.strip(), "price": price.strip()})

        # 4. Exiting the context manager stops the Redroid container cleanly.
        return results


data = asyncio.run(
    scrape_protected_site("https://bot-protected-example.com/products")
)
for row in data:
    print(row)

Line-by-Line Explanation

Lines	What happens
`async with AsyncDamru()`	Starts the Redroid Docker container; launches Android Chrome; establishes CDP.
`browser.new_page()`	Opens a tab in real Android Chrome — not a desktop browser with patched APIs.
`page.goto(url)`	Request sent with authentic Android Chrome TLS fingerprint, real `sec-ch-ua`, genuine mobile UA string.
`page.wait_for_selector(...)`	Standard Playwright selector wait — identical to desktop Playwright usage.
`card.locator(...).inner_text()`	Data extraction using Playwright locators — no change in syntax from standard usage.
Context manager `__aexit__`	Container and Chrome instance stop cleanly; no orphaned Docker processes.

Full API reference, Docker prerequisites (Linux host, KVM support, Docker), and multi-page concurrency examples: github.com/akwin1234/damru.

Handling Pagination, Rate-Limiting, and Data Storage

Once you can load a protected page, the rest of a scraping pipeline is standard Python:

import csv, asyncio
from damru import AsyncDamru

async def scrape_all_pages(base_url: str, total_pages: int):
    async with AsyncDamru() as browser:
        all_items = []
        for page_num in range(1, total_pages + 1):
            page = await browser.new_page()
            await page.goto(f"{base_url}?page={page_num}", wait_until="networkidle")
            items = await page.locator(".item").all_inner_texts()
            all_items.extend(items)
            await page.close()
            await asyncio.sleep(1.5)  # polite crawl delay
        return all_items

rows = asyncio.run(scrape_all_pages("https://example.com/catalog", total_pages=5))
with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerows([[r] for r in rows])

A polite crawl delay (asyncio.sleep) between pages reduces server load and avoids triggering rate-limit responses — good practice regardless of the scraping tool.

Ethical and Legal Considerations for Python Web Scraping

Python and web scraping intersect with a real legal and ethical landscape. Before scraping any site:

Check robots.txt — the Disallow directives indicate which paths the site owner asks crawlers to skip.
Review the Terms of Service — many sites prohibit automated access to data.
Avoid personal data — scraping personal information without a lawful basis raises GDPR, CCPA, and other privacy law concerns.
Rate-limit your requests — aggressive crawl rates can constitute a denial-of-service impact on the target server.
Consult applicable law — the CFAA (US), Computer Misuse Act (UK), and equivalent statutes in your jurisdiction are relevant.

Damru is designed for legitimate use cases: research, QA testing, competitive price monitoring on publicly available data, and mobile-browser compatibility analysis. It is not intended for unauthorized access, credential attacks, or bypassing authentication systems.

When to Use Which Python Scraping Tool

Static site, no bot protection → requests + BeautifulSoup
Async static scraping → httpx + parsel
JavaScript-rendered, basic protection → Playwright (desktop headless)
JavaScript-rendered, medium protection → SeleniumBase UC mode
Bot-protected, mobile fingerprint required, or Android-specific rendering → Damru

For deeper context on stealth and antidetect tooling:

Frequently Asked Questions

What is the best Python library for web scraping in 2026? It depends on the site. For static HTML, requests + BeautifulSoup is fastest. For JavaScript-rendered pages, Playwright leads the field. For bot-protected or mobile-rendered targets where fingerprinting defeats desktop browsers, Damru provides genuine Android Chrome automation through the same Playwright Python API.

How do I scrape JavaScript-rendered pages with Python? Use a headless browser library — Playwright or Selenium — that executes JavaScript and waits for dynamic content to load. For sites that block headless desktop browsers, Damru runs real Android Chrome via Redroid, bypassing desktop-browser fingerprint detection at the TLS and browser-engine layers.

What makes Damru more effective on bot-protected sites? Damru runs genuine Android Chrome inside a Docker container, not a desktop browser with patched JavaScript APIs. Every signal a bot detector reads — TLS fingerprint, User-Agent, WebGL renderer, sensor availability — comes from a real Android environment. There is no patch layer for detectors to identify.

Does Damru support Python async and concurrent scraping? Yes. AsyncDamru is an async context manager built on asyncio and async_playwright. Multiple new_page() calls within a single AsyncDamru session support concurrent tab scraping.

Is Damru free to use in commercial projects? Damru is released under the PolyForm Noncommercial 1.0.0 license, which is free for personal, educational, and noncommercial use; commercial use — including paid scraping, paid automation, or SaaS — requires a separate commercial license. Always verify that your specific scraping use case complies with the target site’s Terms of Service and applicable law in your jurisdiction.

Survey every option in the Python web scraping libraries comparison and the broader field of Python browser automation.
When one machine is not enough, move to web scraping at scale with device pooling.
Understand the TLS fingerprinting checks that defeat plain HTTP clients.
Install Damru with pip to handle bot-protected pages, and manage the worker pool from the instance manager.