[Tanwydd]

Posted on May 10

Phantomime: I Spent Three Articles Explaining Bot Detection. Here's the Library I Built to Beat It.

#webdev #python #automation #opensource

If you've been following this series, you already know the full picture.

We started with TLS fingerprinting — the fact that Python's HTTP stack sends a ClientHello that looks nothing like Chrome, and that alone is enough to get blocked before a single line of page JavaScript ever runs.

Then Canvas fingerprinting — why randomizing the canvas output on every call is actually worse than doing nothing, and why real browsers produce stable, hardware-bound hashes that your bot needs to replicate correctly.

Then behavioral fingerprinting — mouse trajectories, keystroke timing distributions, scroll inertia, and why fixing TLS and canvas still isn't enough if your mouse teleports to coordinates in zero milliseconds.

Each article was covering one layer of the same problem I was actively solving in code. Phantomime is what that code became.

The Problem With Piecemeal Solutions

There are existing tools that patch parts of this. Playwright-stealth disables navigator.webdriver. Some scripts patch the canvas. Others spoof the User-Agent.

The issue is that detection systems don't check one signal — they check dozens simultaneously and look for consistency across them. A User-Agent claiming Windows with a Linux navigator.platform. A canvas fingerprint that changes on every call. A WebGL renderer string that doesn't match any real GPU. A mouse that moves in a perfect straight line at 60fps.

Each individual patch makes things marginally better. Without coherence across all of them, you're still obviously a bot — just a slightly better-dressed one.

Phantomime was built around a single principle: every layer must be solved together, and they must be internally consistent.

The Four Layers

Layer 1 — TLS (the one most people forget)

As covered in part one of this series, the TLS ClientHello is the first thing a server sees. Playwright uses Chromium's network stack, which is fine — but any direct Python HTTP call (for efficiency, for API access, for anything) exposes Python's TLS fingerprint.

Phantomime solves this with curl-cffi, which impersonates Chrome 124's TLS stack at the socket level. After authenticating through the browser, you can switch to direct HTTP calls without leaking your stack:

# Authenticate via the humanized browser
await browser.goto("https://target.com/login")
await browser.type_text("#email", "user@example.com")
await browser.type_text("#password", "secret")
await browser.click("#submit")
await browser.wait_for(".dashboard")

# Export session to curl-cffi — same cookies, Chrome TLS fingerprint
await browser.sync_cookies_to_session("https://target.com")

# Direct HTTP calls — 10-50x faster than browser navigation
for item_id in item_ids:
    resp = await browser.fetch(f"https://target.com/api/items/{item_id}")
    process(resp.json())

Layer 2 — Browser Fingerprint

This is where most patching libraries stop, and where most of them get the details wrong.

As explained in part two, the key insight about canvas fingerprinting is stability. Real browsers produce the same canvas hash on every call on the same hardware. A bot that randomizes per-call is just as detectable as one that blocks the canvas entirely — the instability itself is the signal.

Phantomime uses a Linear Congruential Generator seeded from an MD5 hash of the profile directory name. The noise is stable for the entire session, and different across sessions — exactly what real hardware produces:

profile_dir name → MD5 → LCG seed → fixed noise sequence for this session

The same principle applies to WebGL (readPixels output), AudioContext (getChannelData), and getBoundingClientRect (used for font enumeration via measureText).

Beyond noise, every surface property is derived from a single hardware profile to ensure coherence:

Property	Source
User-Agent	Profile
`Sec-CH-UA`, `Sec-CH-UA-Platform`	Derived from profile UA
`navigator.platform`	Profile OS (`Win32`, `MacIntel`, `Linux x86_64`)
`navigator.deviceMemory`	Profile (4 / 8 / 16 GB)
`navigator.hardwareConcurrency`	Profile (4 / 8 cores)
`screen.width/height`	Profile resolution
`window.devicePixelRatio`	Profile DPR
WebGL vendor/renderer	Profile GPU string

No mismatched properties. No "Intel GPU" claiming to be an RTX 4090.

One detail worth mentioning: headless=True in Playwright activates --headless=old (pipe mode), which disables the GPU pipeline and makes WebGL/Canvas outputs immediately distinguishable. Phantomime always passes headless=False to Playwright and injects --headless=new as a launch argument — the browser is headless, but the GPU pipeline is intact.

Layer 3 — Behavioral Signals

Covered in detail in part three. The short version: real users don't move their mouse in straight lines, don't type at perfectly uniform intervals, and don't scroll in instant jumps.

Mouse movement uses cubic Bézier trajectories with Fitts' Law velocity modulation — movement time scales with distance and target size, just like human hand movements. 30% of clicks include an overshoot followed by a correction micro-movement.

Typing uses a log-normal inter-keystroke delay distribution (not uniform random). A configurable typo_rate (default 4%) injects QWERTY-neighbor errors with autocorrection. A frustration_rate (default 1%) simulates over-deletion — backspacing one character too many and retyping.

Scroll uses inertial easing with intermediate mousemove events dispatched during the scroll — matching trackpad behavior.

Idle periods include micro-movements, occasional scroll pulses, and randomized pauses drawn from an exponential distribution. warmup() runs a full idle cycle before the first navigation to age the session.

One patch that often gets overlooked: Event.isTrusted. Playwright dispatches synthetic events with isTrusted: false by default — a reliable signal for detection systems listening to event properties. Phantomime patches this to return true for all synthetic events.

Layer 4 — `Function.prototype.toString` Spoofing

When you patch navigator.webdriver or HTMLCanvasElement.prototype.toDataURL, detection scripts can call .toString() on those functions and check whether they return "function toDataURL() { [native code] }" or your custom JS. If they return your custom code, you're caught.

Phantomime patches Function.prototype.toString after all other patches are in place, so every patched function appears native to JS-level inspection.

Concurrency

For volume scraping, run_swarm launches N browsers in parallel, each with its own profile directory and therefore its own distinct fingerprint:

from phantomime import HumanBrowser, run_swarm

async def scrape(browser: HumanBrowser, item: dict) -> dict:
    await browser.goto(item["url"])
    await browser.wait_for(".product-detail")
    return {
        "id":    item["id"],
        "title": await browser.get_text("h1"),
        "price": await browser.get_text(".price"),
    }

results = await run_swarm(
    task=scrape,
    items=product_list,       # list of dicts
    max_concurrent=8,         # tune to available RAM (~350MB per instance)
    browser_kwargs={"headless": True, "locale": "en-US"},
    profile_base_dir="./profiles",
)

Each worker gets ./profiles/worker_0, ./profiles/worker_1, etc. — distinct directory names mean distinct LCG seeds mean distinct fingerprints. Ten workers running in parallel look like ten different machines to the target site.

Installing and Getting Started

pip install phantomime
playwright install chromium

# Optional but recommended — enables the TLS layer
pip install curl-cffi

Basic usage:

import asyncio
from phantomime import HumanBrowser

async def main():
    async with HumanBrowser(
        profile_dir="./profiles/session_01",
        headless=True,
        locale="en-US",
        timezone="America/New_York",
    ) as browser:
        await browser.warmup(duration_s=4.0)
        await browser.goto("https://example.com")
        await browser.type_text("#search", "python automation")
        await browser.click("button[type=submit]")
        await browser.wait_for(".results")
        print(await browser.get_text(".results"))

asyncio.run(main())

The profile directory is persistent — cookies, localStorage, and cache survive across runs. Run once to log in manually, and every subsequent run reuses the saved session.

What It Doesn't Do

CAPTCHA solving — out of scope. Integrate a third-party solver (2captcha, CapMonster) and inject the token via browser.evaluate().

IP reputation — if your IP is in a datacenter range or on a known proxy list, no amount of fingerprint patching helps. Use residential proxies for targets that maintain IP reputation lists.

Cloudflare Turnstile — the interactive checkbox requires a solver. The JS challenge (the spinning wheel) resolves fine with idle(duration_s=8.0).

Links

PyPI: pypi.org/project/phantomime
GitHub: github.com/Tanwydd/phantomime

This series started as a way to document what I was learning while building something real. If any of the previous articles helped you understand why your scraper was getting blocked, this is where all of it ends up.

Questions, issues, and PRs welcome.

DEV Community

Phantomime: I Spent Three Articles Explaining Bot Detection. Here's the Library I Built to Beat It.

The Problem With Piecemeal Solutions

The Four Layers

Layer 1 — TLS (the one most people forget)

Layer 2 — Browser Fingerprint

Layer 3 — Behavioral Signals

Layer 4 — `Function.prototype.toString` Spoofing

Concurrency

Installing and Getting Started

What It Doesn't Do

Links

Top comments (0)

The Problem With Piecemeal Solutions

The Four Layers

Layer 1 — TLS (the one most people forget)

Layer 2 — Browser Fingerprint

Layer 3 — Behavioral Signals

Layer 4 — Function.prototype.toString Spoofing

Concurrency

Installing and Getting Started

What It Doesn't Do

Links

Layer 4 — `Function.prototype.toString` Spoofing