Browser Sessions: Stateful Web Automation Behind a CDP Connection

#scraping #python #api #automation

Browser Sessions: Stateful Web Automation Behind a CDP Connection

Most web automation breaks the moment a site asks you to log in. Your scraper fetches the login page, posts credentials, and then... loses the session cookie on the next request because it spun up a fresh context. Or it hits a JS-rendered dashboard and gets back an empty <div id="app"></div>.

The root problem is statefulness. HTTP is stateless by design, and most scraping tools treat each request as isolated. But real user flows are not isolated. Logging in, clicking through a multi-step form, waiting for a WebSocket to push data, then reading the result: that is a single continuous session, and automating it requires a browser that remembers where it has been.

This is where Chrome DevTools Protocol (CDP) sessions come in.

What CDP Actually Gives You

CDP is the protocol Chrome and Chromium expose for external control. Tools like Playwright and Puppeteer sit on top of it. When you connect to a CDP endpoint, you get a persistent channel to a running browser instance. You can send commands (Page.navigate, Input.dispatchMouseEvent, Runtime.evaluate), subscribe to events (Network.responseReceived, Page.loadEventFired), and read the DOM at any point.

The key word is persistent. Unlike a one-shot headless request, a CDP session keeps the browser alive between your commands. Cookies stay. LocalStorage stays. Auth tokens stay. If the site uses a session cookie set after login, your next navigation carries that cookie automatically because it is in the same browser profile.

Anakin's Browser Sessions API exposes exactly this: a hosted, stateful CDP connection you can attach to without managing your own Chromium fleet. You get a wsEndpoint URL, connect with Playwright or Puppeteer, and the browser persists across your script.

The Login-Then-Scrape Pattern

Here is a concrete example. Say you need data from a SaaS dashboard that sits behind an email/password login. The page loads via React, so the data never appears in the initial HTML. You need to:

Navigate to the login page.
Fill and submit the form.
Wait for the dashboard to render.
Extract the data from the live DOM.

With a CDP-backed session, this is straightforward:

import asyncio
from playwright.async_api import async_playwright
import httpx

ANAKIN_API_KEY = "your_api_key"

async def scrape_dashboard():
    # Create a browser session via Anakin's API
    response = httpx.post(
        "https://api.anakin.ai/v1/browser-sessions",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"session_ttl": 300}  # 5-minute session
    )
    session = response.json()
    ws_endpoint = session["wsEndpoint"]

    async with async_playwright() as p:
        # Connect to the existing hosted browser instance
        browser = await p.chromium.connect_over_cdp(ws_endpoint)
        context = browser.contexts[0]
        page = await context.new_page()

        # Step 1: navigate to login
        await page.goto("https://example-saas.com/login")
        await page.wait_for_selector("input[name='email']")

        # Step 2: fill and submit
        await page.fill("input[name='email']", "user@example.com")
        await page.fill("input[name='password']", "s3cur3pass")
        await page.click("button[type='submit']")

        # Step 3: wait for dashboard content to render
        await page.wait_for_selector(".dashboard-metrics", timeout=10000)

        # Step 4: extract data from the live DOM
        metrics = await page.eval_on_selector_all(
            ".metric-card",
            "cards => cards.map(c => ({ label: c.querySelector('.label').textContent, value: c.querySelector('.value').textContent }))"
        )

        print(metrics)
        # [{'label': 'Monthly Revenue', 'value': '$42,180'}, ...]

        await browser.close()

asyncio.run(scrape_dashboard())

A few things worth noting here. connect_over_cdp attaches to the remote browser rather than launching a new one. The browser.contexts[0] line reuses the existing context, which means any cookies or storage the session already has are available. And wait_for_selector is doing real work: it is polling the DOM until the React component finishes rendering, which a static HTTP request would never see.

Handling the Awkward Edge Cases

Stateful sessions introduce problems that stateless scraping does not have.

Session expiry mid-flow. Sites time out inactive sessions, sometimes in minutes. If your script pauses between steps (say, you are processing data from step 3 before continuing), the site may have logged you out. The fix is to check for the presence of a login redirect at each navigation step, not just at the start.

2FA and CAPTCHAs. Some sites push a CAPTCHA after login, especially on new IP addresses. CDP sessions let you intercept the page at that point and either route to a CAPTCHA-solving service or pause and wait for a human to complete it, then resume programmatically. You keep the session alive while the human intervenes.

Memory and resource leaks. A long-running CDP session accumulates open pages, event listeners, and network logs. If you are running many sessions in parallel, close pages you are done with explicitly. Do not rely on garbage collection.

Flaky selectors. JS-heavy apps change their DOM structure on deploys. Prefer aria-label, data-testid, or visible text over deeply nested CSS selectors. They are less likely to break silently.

One pattern that works well for multi-step flows: treat each logical step as a function that asserts it is on the right page before proceeding. If page.url() does not match what you expect, log the actual URL and screenshot, then raise. This turns silent failures into loud ones.

async def assert_on_page(page, expected_path: str):
    current = page.url
    if expected_path not in current:
        await page.screenshot(path="debug_screenshot.png")
        raise RuntimeError(f"Expected path '{expected_path}', got '{current}'")

When to Reach for This vs. Simpler Tools

CDP-based sessions are heavier than a plain HTTP scraper. They consume more memory, have higher latency, and require more careful error handling. Use them when:

The target is behind authentication and session cookies matter.
The data is rendered client-side by JavaScript.
The flow involves multiple user interactions (clicks, form fills, file uploads).
You need to observe network traffic or intercept requests mid-flow.

If the data is in the initial HTML response, a standard scrape API call is faster and cheaper. CDP is for the cases where the simpler tool genuinely cannot reach the data.

The interesting direction from here is combining stateful browser sessions with structured extraction. Once you have a live, authenticated, fully-rendered page, you can pass the DOM to an LLM-backed extractor that reads the content semantically rather than with CSS selectors. That combination handles the "the DOM structure changes but the information is always there" problem, which is the last real reliability bottleneck in production scraping.