Web scraping and data — Photo by Markus Spiske / Unsplash

How I Scrape 91 Websites Every Day Without Getting Blocked

Published on BirJob.com · March 2026 · by the solo developer behind BirJob

Introduction: Why 91 Sites Is Harder Than It Sounds

BirJob.com is Azerbaijan's job aggregator. The idea is simple: instead of checking Kapital Bank's careers page, then SOCAR's, then Azercell's, then twenty government portals, then a handful of international tech boards — you just check one place. One search box, all the jobs.

When I started, I figured scraping job listings would be the easy part. Parse some HTML, store it in a database, done. I was wrong in almost every direction.

Today BirJob pulls from 91 sources. That number isn't marketing — it's the literal count of .py files in scraper/sources/. Every morning at 6 AM, 9 AM, and 1 PM UTC, GitHub Actions fires up a Docker container, runs all of them concurrently (with a semaphore cap of 2 in CI), and writes whatever comes back into a Postgres database. The whole thing typically finishes in 15–20 minutes.

The 91 number hides a lot of pain. Some of those scrapers have been rewritten three or four times. A non-trivial number of them are in the DISABLED_SCRAPERS graveyard — they exist, they compile, and they will never run again until someone fixes whatever broke them. This article is about what actually happens when you try to aggregate job listings across a country's entire internet at scale, as one person, with no budget.

The Architecture: One File, One Source

The rule is simple: each source gets its own file in scraper/sources/. There is no monolith that tries to handle multiple sites. The file for Kapital Bank is kapitalbank.py. The file for SOCAR Downstream is socardownstream_az.py. Each file contains exactly one class that extends BaseScraper and exactly one public async method named either scrape_* or parse_*.

This pattern has served me well. When Azercell changes their API, I open azercell.py and fix it. Nothing else is affected. When I add a new source, I copy an existing file, change the URL and parsing logic, and the manager picks it up automatically. No registration, no config file update, no import list to maintain.

BaseScraper

BaseScraper in scraper/base_scraper.py does three things:

Loads database credentials from DATABASE_URL (or falls back to individual env vars for backward compatibility).
Provides fetch_url_async — an async HTTP helper with User-Agent rotation, retry logic, exponential backoff, and encoding detection.
Provides save_to_db — an upsert method that writes jobs to Postgres and handles deduplication, soft-deletes, and source tracking.

Every individual scraper inherits these. A minimal scraper looks like this:

from base_scraper import BaseScraper, scraper_error_handler
import pandas as pd

class KapitalbankScraper(BaseScraper):

    @scraper_error_handler
    async def parse_kapitalbank(self, session):
        url = "https://apihr.kapitalbank.az/api/Vacancy/vacancies?Skip=0&Take=150"
        response = await self.fetch_url_async(url, session)

        if response:
            data = response.get('data', [])
            jobs = []
            for job in data:
                job_id = job.get('id')
                jobs.append({
                    'company': 'Kapital Bank',
                    'vacancy': job['header'],
                    'apply_link': f"https://hr.kapitalbank.az/vacancies/{job_id}"
                })
            return pd.DataFrame(jobs)

        return pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])

That is the entire Kapital Bank scraper. 20 lines of actual logic. The heavy lifting (retries, headers, database writes, error boundaries) is all in the base class.

ScraperManager

ScraperManager in scraper/scraper_manager.py dynamically loads every file in sources/ that is not in the disabled list:

scraper_files = [
    f.stem for f in sources_dir.glob('*.py')
    if f.stem != '__init__' and f.stem not in self.DISABLED_SCRAPERS
]

for scraper_file in scraper_files:
    module = importlib.import_module(f'sources.{scraper_file}')
    for attr_name in dir(module):
        attr = getattr(module, attr_name)
        if (isinstance(attr, type) and
            issubclass(attr, BaseScraper) and
            attr != BaseScraper):
            self.scrapers[scraper_file] = attr
            break

It then runs all of them concurrently using an asyncio.Semaphore, collects results, and calls save_to_db on the combined DataFrame. The whole orchestration is async end to end.

The @scraper_error_handler decorator

Every public method on every scraper is decorated with @scraper_error_handler. Here is the actual implementation:

def scraper_error_handler(func: Callable) -> Callable:
    @wraps(func)
    async def wrapper(self, *args, **kwargs):
        try:
            return await func(self, *args, **kwargs)
        except BaseException as e:
            error_msg = f"Error in {func.__name__}: {str(e)}"
            logger.error(error_msg)

            if os.getenv('GITHUB_ACTIONS') == 'true':
                empty_df = pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])
                empty_df.attrs['scraper_error'] = {
                    'error_type': type(e).__name__,
                    'error_message': str(e),
                    'function_name': func.__name__
                }
                return empty_df

            return pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])
    return wrapper

Notice it catches BaseException, not just Exception. That was intentional after I hit a case where a scraper raised a KeyboardInterrupt-like signal in a subprocess context and killed the entire run. Now nothing escapes.

When running on GitHub Actions, the decorator also attaches error metadata to the returned DataFrame's attrs dict. The manager checks for this marker to distinguish "scraper found zero jobs legitimately" from "scraper crashed with an exception". Both return an empty DataFrame, but only one is a bug.

Three Strategies for Modern Websites

Over time I've settled into a mental model with four tiers, ordered by preference:

BeautifulSoup on static HTML (fast, cheap, reliable when it works)
JSON API discovery via DevTools (fast, cheap, much more stable than HTML scraping)
__NEXT_DATA__ extraction (elegant hack for Next.js government portals)
Playwright (last resort, slow, breaks, expensive in CI)

Strategy 1: BeautifulSoup on Static HTML

The majority of the 91 scrapers use this. Fetch a URL, parse the HTML, pull out title and link. Most Azerbaijani company career pages are server-rendered: they generate HTML on the backend and send it to the browser as-is. BeautifulSoup handles those easily.

The workflow is the same every time. Open DevTools in Chrome. Load the careers page. Right-click a job title and click "Inspect". Find the outermost container that repeats for each job. Identify what uniquely identifies it — a class name, a tag, an attribute pattern. Write the selector.

KPMG Azerbaijan is a clean example of this. Their careers page at kpmg.com/az/en/home/careers/our-vacancies.html renders job listings as <h6> elements containing anchor links. The scraper:

h6_elements = soup.find_all('h6')
for h6 in h6_elements:
    link = h6.find('a', href=True)
    if not link:
        continue
    title = h6.get_text(strip=True)
    href = link.get('href', '')
    if title and href:
        jobs.append({
            'company': 'KPMG Azerbaijan',
            'vacancy': title,
            'apply_link': href
        })

The problem: this structure can change without warning. KPMG's page used to be a table in a div.bodytext-data container. Then it became an h6 list. The scraper now has three fallback strategies stacked: try the h6 structure first, then the cmp-text-list__item list, then the legacy table. This is common across many scrapers.

Strategy 2: JSON APIs Found via DevTools Network Tab

This is the best outcome. When you find that a site is actually calling a JSON API under the hood, you can bypass the HTML entirely. The data is cleaner, the parsing is trivial, and the endpoint tends to be stable even when the frontend redesigns every six months.

Kapital Bank is the cleanest example. Their HR portal calls a public API endpoint: https://apihr.kapitalbank.az/api/Vacancy/vacancies?Skip=0&Take=150&SortField=id&OrderBy=true. That URL returns a JSON object with a data array. Each element has id, header, location, employmentType, and deadLine. The scraper is 20 lines long and has never broken.

SOCAR Downstream is slightly more interesting. Their vacancies page makes an AJAX POST to /lazy_load_vacancies/1 with a page number and CSRF token in the body. The response is a JSON object where files_html is an array of HTML strings — each string being the rendered HTML for one vacancy card. You have to parse the HTML from the JSON. Weird, but once you know the pattern it is completely reliable.

Azercell was a bug that turned into a lesson. Their careers site at azercell.easyhire.me is paginated, and the URL to get JSON data is:

https://azercell.easyhire.me/job/search?json=true&page=0

The key parameter is ?json=true, not ?json. When I first wrote the scraper I used ?json (without the value), and the server just returned HTML. I spent an embarrassingly long time trying to parse the HTML before I caught the difference in the network tab. The ?json=true form returns a clean JSON object with a jobs array and a count total for pagination.

When hunting for these hidden APIs, I look for a few patterns in the network tab:

XHR or Fetch requests to /api/, /v1/, /v2/
Requests with Accept: application/json headers
POST requests to /graphql (which is its own mess, covered below)
URL parameters like ?format=json, ?json=1, ?output=json
Requests to third-party ATS providers: Huntflow, EasyHire, Oracle HCM, SAP SuccessFactors

Strategy 3: __NEXT_DATA__ Extraction

This is my favourite approach for government portals built on Next.js. Every Next.js page embeds a <script id="__NEXT_DATA__" type="application/json"> tag in the HTML containing the full server-side props that were used to render the page. This includes the actual data — in the case of a vacancies page, the full list of jobs.

TABIB (the state healthcare management union) runs a Next.js site. The vacancies page at tabib.gov.az/vetendashlar-ucun/vakansiyalar previously had CSS class names like vacanycard__title__3fH9k — hashed identifiers generated by CSS Modules that change every time the site is rebuilt. My original scraper broke after their first post-launch deployment. I fixed it by ignoring the CSS entirely:

soup = BeautifulSoup(response, 'html.parser')
script_tag = soup.find('script', id='__NEXT_DATA__')
if script_tag:
    next_data = json.loads(script_tag.string)
    vacancies = (
        next_data.get('props', {})
        .get('pageProps', {})
        .get('vacancies', [])
    )
    for v in vacancies:
        title = v.get('name') or v.get('title') or v.get('positionName') or 'N/A'
        slug = v.get('slug') or v.get('url') or v.get('id', '')
        apply_link = urljoin(base_url, f"/vetendashlar-ucun/vakansiyalar/{slug}")
        jobs.append({'company': 'TABIB', 'vacancy': title, 'apply_link': apply_link})

The CSS classes can change on every deployment. __NEXT_DATA__ almost never changes structure unless the developer explicitly changes their data model. It is a much more stable anchor than class names.

The pattern works on any Next.js site. You fetch the page HTML, find the script tag by its id, parse the JSON, and navigate the props.pageProps tree to find your data. The exact path varies by site, but the outer container is always the same.

Strategy 4: Playwright (The Nuclear Option)

When nothing else works, I reach for Playwright. It launches a real Chromium browser, navigates to the page, waits for JavaScript to execute, and then lets you extract content from the fully-rendered DOM.

The tradeoffs are severe:

A Playwright scrape takes 20–60 seconds per site. A fetch_url_async call takes 1–3 seconds.
Playwright requires a full Chromium install inside the Docker container, adding ~300MB to the image.
It fails in ways that are hard to diagnose: browser launch timeout, selector not found, page navigation error, out-of-memory in a container.
Sites that block scrapers based on browser fingerprinting are even better at detecting headless Chromium than they are at detecting aiohttp.

Currently only two scrapers use Playwright in production: busy.py (the busy.az job board, which is a Next.js SPA with no server-rendered HTML) and the Playwright fallback path in mckinsey.py (which rarely succeeds and is effectively disabled).

The busy.az Playwright scraper looks like this in simplified form:

async with async_playwright() as p:
    browser = await p.chromium.launch(
        headless=True,
        args=['--no-sandbox', '--disable-setuid-sandbox']
    )
    context = await browser.new_context(
        viewport={'width': 1920, 'height': 1080},
        user_agent='Mozilla/5.0 ...'
    )
    page = await context.new_page()

    for page_num in range(1, 4):
        url = f"https://busy.az/vacancies?page={page_num}"
        await page.goto(url, wait_until='domcontentloaded', timeout=45000)
        try:
            await page.wait_for_selector('a[href*="/vacancies/"]', timeout=20000)
        except PlaywrightTimeout:
            await asyncio.sleep(3)
        content = await page.content()
        # ... parse with BeautifulSoup

I use --no-sandbox because the container runs as root in GitHub Actions and Chromium refuses to start otherwise. The wait_for_selector call with a timeout fallback handles the case where the SPA loads but the jobs section takes longer than expected.

The Problems Nobody Warns You About

CSS Class Hashes That Change Every Deploy

Modern frontend tooling — Next.js, Create React App, Vite — generates CSS Modules by default. Class names like .VacancyCard_title__3fH9k contain a hash of the source file or its contents. When a developer rebuilds the site after any CSS change, the hash changes. Your selector breaks.

The TABIB situation was the clearest example. My first version of that scraper used:

vacancy_cards = soup.find_all('div', class_=lambda c: c and 'vacanycard' in ' '.join(c).lower())

That worked for about two weeks. After TABIB pushed a new release, the class names were completely different. The __NEXT_DATA__ rewrite solved it permanently.

The general defense is to never rely on a class name that looks like it contains a hash. Prefer tag-based selectors, data-* attributes, structural patterns (nth-child, parent-child relationships), or embedded JSON data. If you absolutely must use a class name, use substring matching rather than exact matching so minor changes don't break you.

GitHub Actions IPs Getting Blocked

GitHub Actions runners come from a well-known set of IP ranges that Microsoft/GitHub publishes publicly. Bot-detection systems — Cloudflare, Akamai, custom IP blocklists — know these ranges and block them.

Djinni.co is the most frustrating case. Locally, the djinni scraper works perfectly. It fetches 30+ pages, returns hundreds of tech jobs from Ukrainian and Eastern European companies. On GitHub Actions, it gets Errno 104 - Connection Reset by Peer on the first request. Not a 403. Not a timeout. A TCP-level connection reset, which means the server doesn't even want to complete the handshake.

I spent several hours trying to work around this. Rotating User-Agents didn't help — the block is at the IP level. Adding realistic browser headers didn't help. Even adding random delays between requests didn't help. The server simply refuses connections from GitHub's IP space. Djinni is now in DISABLED_SCRAPERS.

hrcbaku.az has the same problem. Connection reset after ~464 seconds of trying, which means GitHub Actions was burning almost 8 minutes per run on that scraper alone before I disabled it.

The lesson: if a scraper works locally but consistently fails in CI with connection errors (not HTTP errors), it's almost certainly an IP block. There is no clean solution except using a residential proxy or running the scraper from your own server with a static IP.

Sites Changing Their HTML Structure Without Warning

I have no monitoring for "is this scraper's HTML selector still correct." The scrapers run, they either return data or they return zero rows, and I find out in the Telegram notification summary after the run.

The KPMG history is the clearest example of this. The site was originally a table:

<div class="bodytext-data">
  <table>
    <tbody>
      <tr>
        <td>Senior Associate</td>
        <td><a href="https://kpmgcca.global.huntflow.io/apply/...">Apply</a></td>
      </tr>
    </tbody>
  </table>
</div>

At some point KPMG redesigned their careers section. The table disappeared. Jobs were now listed as <h6> elements with anchor links pointing to their Huntflow ATS instead of the old PDF applications. The scraper returned zero for several days before I noticed.

My current approach is to look at the zero-result list after each run and investigate anything that used to return jobs. If a scraper returns zero for more than three consecutive days, something is probably wrong. I haven't automated this check yet.

Cloudflare Challenges

Cloudflare's bot challenge works by serving a JavaScript puzzle instead of the actual page content. Your scraper receives a 200 response containing the Cloudflare challenge HTML, not the page you wanted. fetch_url_async considers this a success because the status code was 200, and your scraper silently produces zero results because BeautifulSoup can't find any job listings in a Cloudflare challenge page.

workly.az is currently blocked by Cloudflare on GitHub Actions. The site works fine locally. In CI, every request gets the challenge page. There is no way around this without either a real browser (Playwright) or a residential proxy. I disabled the scraper rather than let it consume CI time returning zero.

The signature of a Cloudflare block in logs is: scraper completes in under 2 seconds, returns zero jobs, no errors. Too fast and too clean. If you check the raw HTML response, you will see something like <title>Just a moment...</title> and references to challenge-platform.

GraphQL SPAs With No Documented API

boss.az is Azerbaijan's largest job board. It's built on Next.js with Apollo GraphQL on the backend. Every page you see is rendered client-side, fed by GraphQL queries. There is no server-rendered HTML with job listings.

I wrote a GraphQL scraper that tries three different query shapes against https://boss.az/graphql:

query_attempts = [
    {
        "query": """
            query GetVacancies($page: Int, $perPage: Int) {
                vacancies(page: $page, perPage: $perPage) {
                    list {
                        id name positionName
                        profile { name logoUrl }
                    }
                    total
                }
            }
        """,
        "variables": {"page": 1, "perPage": 50}
    },
    # ... two more guesses
]

The endpoint responds, but none of the guessed schemas match the actual API schema. GraphQL returns structured error messages when the schema is wrong, so you know immediately that your query is invalid — but you don't know what the correct schema is without introspection, and Boss.az has introspection disabled. The scraper always returns zero and is now disabled.

The correct approach here would be to intercept the actual GraphQL queries the browser sends (via a proxy or Playwright's network interception API), capture the exact query format, and replay it. I haven't done this yet.

The DISABLED_SCRAPERS Graveyard

The ScraperManager maintains a set called DISABLED_SCRAPERS (loaded from the database, with a hardcoded fallback):

DISABLED_SCRAPERS_FALLBACK = {
    "projobs_vacancies", "boss_az", "workly_az", "bfb", "djinni",
    "guavapay", "mckinsey", "its_gov", "isbu_az", "bp",
    "tabib_vacancies", "hrcbaku",
}

Each of these scrapers has a story.

projobs_vacancies — Dead API

ProJobs is an Azerbaijani job board. The original scraper targeted their documented REST API at core.projobs.az/v1/vacancies. At some point that endpoint simply started returning 404. The API version was probably deprecated when they rewrote their backend. I wrote a new version that tries several API candidate URLs, but none of them work. Disabled pending someone figuring out their current API.

boss_az — GraphQL Schema Unknown

Covered above. Three query attempts, three failures, zero jobs, always. The scraper code exists and is technically functional — it just cannot discover the correct schema without introspection access or source code.

bp — Pure Algolia/JavaScript

BP's careers page for Azerbaijan is a React application that loads job listings from Algolia search on the client side. The server sends HTML that contains empty job-container divs and a bundle of JavaScript. The JavaScript makes authenticated Algolia API calls using a short-lived key embedded in the bundle.

I found the Algolia application ID and search key (RF87OIMXXP, 55a63aab6a8a8b6be5266a69f9275540) by reading the page source. But Algolia search keys are scoped — this key is for the main bp.com site index, not the careers index. Searching it for Azerbaijan jobs returns zero results because the careers data lives in a different index. The key was also likely rotated since I captured it.

The honest summary in the scraper's logs:

Page loads successfully (200 OK)
Algolia search system detected
19 UI containers found for dynamic job loading
No jobs found in static HTML (expected for dynamic sites)

BP genuinely may not have open positions in Azerbaijan most of the time. But even when they do, we can't get at them without either Playwright or a reverse-engineered Algolia key.

mckinsey — Playwright Timeout at 140 Seconds

The McKinsey scraper tries their documented API at mckapi.mckinsey.com/api/jobsearch?cities=Baku first. When that fails (which it usually does in CI because the endpoint requires specific origin headers), it falls back to Playwright.

The Playwright fallback navigates to mckinsey.com/careers/search-jobs?cities=Baku, waits for li.job-listing selectors to appear, and then extracts the content. On a real browser on a fast connection, this takes 5–10 seconds. On GitHub Actions in a Docker container, the navigation timeout is set to 90 seconds. The selector wait is 45 seconds. That's 135 seconds of potential waiting before a single scraper gives up.

Even worse, McKinsey's site uses aggressive bot detection. The Playwright-rendered page often gets a consent wall or a bot challenge that prevents the job listings from ever loading. The scraper times out at ~140 seconds having collected nothing. Disabled.

djinni — GitHub Actions IPs Blocked

Already covered. Works locally, gets connection-reset in CI. Djinni is a popular platform for Eastern European tech jobs and would be valuable to include. I'm watching for a solution that doesn't involve paying for residential proxies.

workly_az — Cloudflare Blocks

Also covered. The scraper code is clean and correct. The site simply refuses to serve content to GitHub Actions IPs.

bfb — Genuinely No Listings

BFB's careers page has no job listings. It is a CV submission form. There is nothing to scrape. The file exists because someone asked me to add BFB as a source before I discovered this. Disabled, not broken.

isbu_az and its_gov — Structural Changes and Timeouts

isbu.az changed their CSS class structure at some point. The scraper's primary selector (a.vacancies__item) no longer matches anything. There is a fallback that tries to find any anchor with /vakansiya/ in the href, but the site now has enough client-side rendering that this also returns nothing from the static HTML.

its.gov.az (the government IT agency) times out at ~192 seconds. The server accepts the connection but never sends a complete response. This is either extremely slow server-side rendering or the server is down but keeps connections open. Either way, it burns 3 minutes of CI time per run.

Making It Reliable

@scraper_error_handler — Blast Radius Containment

The fundamental guarantee that the system provides: one scraper crashing cannot affect any other scraper. Without this, a single unhandled exception in the KPMG scraper would propagate up through asyncio.gather and potentially terminate the entire run.

The decorator provides this guarantee by catching everything at the method level, before the exception has a chance to reach the caller. The asyncio.gather in ScraperManager.run_all_scrapers uses return_exceptions=True as a second line of defense, but the decorator should make that unnecessary for well-formed scrapers.

The BaseException catch is deliberate. Python's Exception class does not catch SystemExit, KeyboardInterrupt, or GeneratorExit. In an async context with Playwright or subprocess calls, these can appear in unexpected places. BaseException catches everything.

fetch_url_async — The HTTP Layer

fetch_url_async in BaseScraper wraps aiohttp with several layers of defence:

User-Agent rotation. There is a pool of 10 realistic User-Agent strings covering Chrome on Windows, Chrome on Mac, Chrome on Linux, Firefox on Windows, Firefox on Mac, Safari on Mac, and Edge on Windows. Each request picks one at random. This is not sophisticated bot evasion, but it prevents the most naive User-Agent-based filters.

Exponential backoff for rate limiting. On 403, 429, 503, 502, and 504 responses, the function waits and retries. In GitHub Actions mode:

wait_time = min((3 ** attempt) + random.uniform(2, 5), 20)

In local mode it uses a gentler 2 ** attempt base. Up to 3 retry attempts are made before giving up.

Content-type aware decoding. If the response Content-Type is application/json, the function calls response.json() and returns a dict. Otherwise, it reads the bytes and decodes them, trying UTF-8 first and falling back to chardet for automatic encoding detection. This matters for Azerbaijani sites that sometimes serve content in Windows-1252 or ISO-8859-1.

Configurable timeouts. 60 seconds in CI, 45 seconds locally. Connect timeout of 20 seconds in both cases. These numbers came from experience: some government portals are genuinely slow and need more than the default 30-second total timeout.

Concurrency Management

Running 91 scrapers fully concurrently would exhaust the connection pool and likely get several sites rate-limited simultaneously. The semaphore in run_all_scrapers limits active scrapers:

if self.is_github_actions:
    max_concurrent = min(max_concurrent, 2)  # Reduce to 2 in CI

semaphore = asyncio.Semaphore(max_concurrent)

async def run_with_semaphore(scraper_name):
    async with semaphore:
        if self.is_github_actions:
            await asyncio.sleep(random.uniform(3, 8))
        else:
            await asyncio.sleep(random.uniform(0.5, 2))
        return await self.run_single_scraper(scraper_name, session)

The concurrency of 2 in CI was reached empirically. Higher values caused cascading failures where multiple scrapers would time out simultaneously, filling the connection pool, and triggering more timeouts. 2 is conservative but has been stable.

The random sleep before each scraper starts (3–8 seconds in CI) staggers the initial burst. Without it, all scrapers start simultaneously and hit their target sites at the same moment, which looks much more like a bot than staggered requests.

The aiohttp.TCPConnector is configured with a global limit of 100 connections and 15 per host in CI:

connector = aiohttp.TCPConnector(
    limit=100,
    limit_per_host=15,
    ttl_dns_cache=300,
    use_dns_cache=True,
    keepalive_timeout=60,
    enable_cleanup_closed=True
)

DNS caching (300-second TTL) reduces lookup overhead on sites with multiple subdomains. enable_cleanup_closed prevents file descriptor leaks over the course of a long run.

The Upsert Pattern

The database write is not a simple insert. Each run performs a full upsert using apply_link as the stable identifier:

INSERT INTO scraper.jobs_jobpost
    (title, company, apply_link, source, last_seen_at, is_active)
VALUES %s
ON CONFLICT (apply_link) DO UPDATE SET
    title        = EXCLUDED.title,
    company      = EXCLUDED.company,
    source       = EXCLUDED.source,
    last_seen_at = NOW(),
    is_active    = TRUE
RETURNING (xmax = 0) AS is_new

The (xmax = 0) AS is_new trick is a Postgres-specific way to determine whether the ON CONFLICT branch ran or the INSERT branch ran. xmax is 0 for newly inserted rows; for updated rows it contains the transaction ID of the updating transaction. This gives me accurate new/updated counts without a separate query.

After upserting, a second query soft-deletes jobs from this source that were not seen in the current run:

UPDATE scraper.jobs_jobpost
SET is_active = FALSE
WHERE source = %s
  AND apply_link NOT IN %s

This means a job that was live yesterday but missing today gets flagged as inactive rather than deleted. Useful for distinguishing "company filled the position" from "scraper had a hiccup." The frontend only shows is_active = TRUE jobs.

Before writing to the database, the scraper also deduplicates by normalized title within each run. The normalization strips parenthetical content ('Frontend Developer (React)' becomes 'frontend developer'), removes location suffixes after a dash ('Senior Engineer - Bakı' becomes 'senior engineer'), and collapses whitespace. This catches the case where the same job is listed twice on the same site with slightly different formatting.

GitHub Actions as Free Infrastructure

The entire scraping pipeline runs on GitHub Actions. There is no server, no cron job on a VPS, no Celery worker. The workflow file at .github/workflows/scraper.yml schedules three runs per day:

on:
  schedule:
    - cron: '0 6,9,13 * * *'

That is 06:00, 09:00, and 13:00 UTC, which corresponds to 10:00, 13:00, and 17:00 in Azerbaijan (UTC+4). Three times daily is frequent enough to catch same-day job postings without hammering the target sites.

The workflow checks out the repository, builds a Docker image from scraper/Dockerfile, and runs it with the database URL and Telegram credentials injected as environment variables from GitHub Secrets.

The Docker build is the slowest step. Playwright requires a full Chromium install. Installing Python packages from requirements.txt takes time. Without caching, the build step alone took 6–8 minutes per run.

The solution is Docker layer caching via GitHub Actions cache:

- name: Cache Docker layers
  uses: actions/cache@v4
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ hashFiles('scraper/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-buildx-

The cache key is based on the hash of requirements.txt. When requirements don't change (which is most runs), the full cached image layers are restored and the build takes about 45 seconds. Only when I add or update a dependency does it rebuild from scratch.

The workflow also uses the --cache-to / --cache-from flags with mode=max:

docker buildx build \
  --cache-from=type=local,src=/tmp/.buildx-cache \
  --cache-to=type=local,dest=/tmp/.buildx-cache-new,mode=max \
  --load \
  -t birjob-scraper:latest \
  .

The mode=max caches all intermediate layers, not just the final image. This is important because the Playwright install layer and the pip install layer are in the middle of the Dockerfile, and without mode=max they would not be cached individually.

The math on GitHub Actions minutes: the free tier provides 2,000 minutes per month. Three runs per day at ~20 minutes each is 60 minutes per day, about 1,800 minutes per month. That fits within the free tier for a public repository. For a private repository the free tier is only 2,000 minutes, which means a couple of days with slow Playwright scrapers or failed retries could burn through the quota. I keep the repository public partly for this reason.

Monitoring: Telegram After Every Run

After each run completes, a Telegram message goes to a private channel with a summary of what happened. The message includes:

Total scrapers run vs. total successful
Number of new jobs added to the database
Number of jobs updated (seen again)
Number of jobs deactivated (not seen in this run)
Total jobs currently in the database
List of scrapers that returned zero results
List of scrapers that errored, with error types
Total run duration

This is the primary way I notice when something breaks. If I see that kapitalbank returned zero for the first time in months, I know to go check their API. If I see a scraper listed under "errored" with an AttributeError, the site probably changed structure.

The GitHub Actions workflow also emits structured annotations using the ::error title=...::message syntax. These appear inline in the Actions UI with red warning icons next to the specific step. For critical failures this is useful, but for routine "this scraper returned zero" events the Telegram notification is easier to scan.

What I don't have: alerting when a previously-healthy scraper starts consistently returning zero. Right now I notice this manually after a few days. A proper monitoring setup would track historical counts per scraper and alert when a scraper that typically returns 50 jobs starts returning 0 for multiple consecutive runs. That's on the backlog.

What I'd Do Differently

After a year of running this, here is what I would change if I started over:

Use residential proxies from day one for the high-value blocked sites. Djinni, boss.az, and a few others are blocked at the IP level in GitHub Actions. I've been ignoring this problem because the proxy cost felt unjustifiable for a free service. But djinni alone would add hundreds of tech job listings that are currently invisible on BirJob. The ROI is probably positive if even a fraction of those users convert to paying customers.

Store raw HTML alongside structured data. Right now, when a scraper breaks because a site changed structure, I have no way to look at what the page looked like when it was last working. If I stored the raw response (even just for a few days), debugging structure changes would be much faster.

Write integration tests that run against real URLs on a weekly schedule. The kind of tests that don't mock anything: actually fetch the page, actually parse it, verify that the result is a non-empty DataFrame with plausible data. This would catch structural changes within days instead of weeks.

Build a proper health dashboard. The Telegram summary is useful but linear. What I want is a table showing, per scraper, the last 30 days of job counts. A scraper that was returning 40 jobs and now returns 5 is showing a warning sign even if it hasn't broken yet. This would help me catch degraded-but-not-zero scrapers, which are currently invisible.

Never use CSS class names that look like they contain hashes. I knew this was fragile from the start but built a few scrapers that way anyway because it was faster. Every one of them has broken at least once. Structural selectors (nth-child, known tags, data attributes) are more work upfront but far more stable.

Consider GraphQL introspection early. When I encounter a GraphQL site now, I try to run an introspection query first to learn the schema. Many GraphQL endpoints have introspection disabled in production but enabled in staging, or have it disabled but ship the schema definition in their JavaScript bundle. Ten minutes of reading bundle output would have saved the hours I spent guessing boss.az query shapes.

Use the __NEXT_DATA__ approach by default for any Next.js site. I still sometimes try CSS-based selectors first on Next.js sites out of habit. I should flip this: always check for __NEXT_DATA__ first, and only fall back to CSS selectors if there is no useful data in the JSON. The JSON path is always more stable.

Rate-limit per domain, not globally. The current semaphore limits total concurrent scrapers to 2. This means we're never making more than 2 requests to any single site at once. But it also means scrapers targeting different sites can't run simultaneously while one slow scraper is blocking a semaphore slot. A per-domain rate limiter would let the fast scrapers finish quickly while the slow ones churn in the background.

Conclusion

At 91 sources, the system works well enough. The morning run starts at 06:00 UTC and most users see fresh data by the time they sit down at their desks in Baku. Companies like Kapital Bank, SOCAR, Azercell, Azerconnect, and dozens of others are checked multiple times per day. New jobs typically appear on BirJob within a few hours of being posted.

The architecture is simple: one file per source, a shared base class, a decorator that contains failures, and a manager that orchestrates everything. The complexity lives in the individual scrapers, not in the framework. Adding a new source is genuinely 20–30 minutes of work. Fixing a broken scraper is usually faster than that.

What makes this kind of project interesting to maintain is that the internet doesn't sit still. Sites redesign, APIs change, bot detection improves, domains go offline, companies switch ATS providers. The DISABLED_SCRAPERS graveyard is a record of all the ways the open web fights back. Most of those battles are losing ones — if a site with Cloudflare protection doesn't want to be scraped from a cloud IP, there isn't much I can do without paying for infrastructure.

But the majority of Azerbaijan's job market is accessible: company career pages, government portals, regional job boards, and local startups. Most of them don't have aggressive bot protection because they don't need to. They want their listings to be visible. BirJob is in the business of helping people find those listings. For that job, 91 scrapers running three times a day is good enough.

BirJob.com is Azerbaijan's job aggregator. If you're a company that wants your listings included, or if you've noticed your site is missing, feel free to reach out via birjob.com.

Loading BirJob...

How I Scrape 91 Websites Every Day Without Getting Blocked

How I Scrape 91 Websites Every Day Without Getting Blocked

Introduction: Why 91 Sites Is Harder Than It Sounds

The Architecture: One File, One Source

BaseScraper

ScraperManager

The @scraper_error_handler decorator

Three Strategies for Modern Websites

Strategy 1: BeautifulSoup on Static HTML

Strategy 2: JSON APIs Found via DevTools Network Tab

Strategy 3: __NEXT_DATA__ Extraction

Strategy 4: Playwright (The Nuclear Option)

The Problems Nobody Warns You About

CSS Class Hashes That Change Every Deploy

GitHub Actions IPs Getting Blocked

Sites Changing Their HTML Structure Without Warning

Cloudflare Challenges

GraphQL SPAs With No Documented API

The DISABLED_SCRAPERS Graveyard

projobs_vacancies — Dead API

boss_az — GraphQL Schema Unknown

bp — Pure Algolia/JavaScript

mckinsey — Playwright Timeout at 140 Seconds

djinni — GitHub Actions IPs Blocked

workly_az — Cloudflare Blocks

bfb — Genuinely No Listings

isbu_az and its_gov — Structural Changes and Timeouts

Making It Reliable

@scraper_error_handler — Blast Radius Containment

fetch_url_async — The HTTP Layer

Concurrency Management

The Upsert Pattern

GitHub Actions as Free Infrastructure

Monitoring: Telegram After Every Run

What I'd Do Differently

Conclusion

You might also like

İş axtarışınıza başlayın

Oxşar məqalələr