How I Scrape 91 Websites Every Day Without Getting Blocked
Published on BirJob.com · March 2026 · by the solo developer behind BirJob
Introduction: Why 91 Sites Is Harder Than It Sounds
BirJob.com is Azerbaijan's job aggregator. The idea is simple: instead of checking Kapital Bank's careers page, then SOCAR's, then Azercell's, then twenty government portals, then a handful of international tech boards — you just check one place. One search box, all the jobs.
When I started, I figured scraping job listings would be the easy part. Parse some HTML, store it in a database, done. I was wrong in almost every direction.
Today BirJob pulls from 91 sources. That number isn't marketing — it's the literal count
of .py files in scraper/sources/. Every morning at 6 AM, 9 AM, and
1 PM UTC, GitHub Actions fires up a Docker container, runs all of them concurrently (with a
semaphore cap of 2 in CI), and writes whatever comes back into a Postgres database. The whole
thing typically finishes in 15–20 minutes.
The 91 number hides a lot of pain. Some of those scrapers have been rewritten three or four
times. A non-trivial number of them are in the DISABLED_SCRAPERS graveyard —
they exist, they compile, and they will never run again until someone fixes whatever broke them.
This article is about what actually happens when you try to aggregate job listings across a
country's entire internet at scale, as one person, with no budget.
The Architecture: One File, One Source
The rule is simple: each source gets its own file in scraper/sources/. There is
no monolith that tries to handle multiple sites. The file for Kapital Bank is
kapitalbank.py. The file for SOCAR Downstream is
socardownstream_az.py. Each file contains exactly one class that extends
BaseScraper and exactly one public async method named either
scrape_* or parse_*.
This pattern has served me well. When Azercell changes their API, I open
azercell.py and fix it. Nothing else is affected. When I add a new source, I copy
an existing file, change the URL and parsing logic, and the manager picks it up automatically.
No registration, no config file update, no import list to maintain.
BaseScraper
BaseScraper in scraper/base_scraper.py does three things:
-
Loads database credentials from
DATABASE_URL(or falls back to individual env vars for backward compatibility). -
Provides
fetch_url_async— an async HTTP helper with User-Agent rotation, retry logic, exponential backoff, and encoding detection. -
Provides
save_to_db— an upsert method that writes jobs to Postgres and handles deduplication, soft-deletes, and source tracking.
Every individual scraper inherits these. A minimal scraper looks like this:
from base_scraper import BaseScraper, scraper_error_handler
import pandas as pd
class KapitalbankScraper(BaseScraper):
@scraper_error_handler
async def parse_kapitalbank(self, session):
url = "https://apihr.kapitalbank.az/api/Vacancy/vacancies?Skip=0&Take=150"
response = await self.fetch_url_async(url, session)
if response:
data = response.get('data', [])
jobs = []
for job in data:
job_id = job.get('id')
jobs.append({
'company': 'Kapital Bank',
'vacancy': job['header'],
'apply_link': f"https://hr.kapitalbank.az/vacancies/{job_id}"
})
return pd.DataFrame(jobs)
return pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])
That is the entire Kapital Bank scraper. 20 lines of actual logic. The heavy lifting (retries, headers, database writes, error boundaries) is all in the base class.
ScraperManager
ScraperManager in scraper/scraper_manager.py dynamically loads every
file in sources/ that is not in the disabled list:
scraper_files = [
f.stem for f in sources_dir.glob('*.py')
if f.stem != '__init__' and f.stem not in self.DISABLED_SCRAPERS
]
for scraper_file in scraper_files:
module = importlib.import_module(f'sources.{scraper_file}')
for attr_name in dir(module):
attr = getattr(module, attr_name)
if (isinstance(attr, type) and
issubclass(attr, BaseScraper) and
attr != BaseScraper):
self.scrapers[scraper_file] = attr
break
It then runs all of them concurrently using an asyncio.Semaphore, collects
results, and calls save_to_db on the combined DataFrame. The whole orchestration
is async end to end.
The @scraper_error_handler decorator
Every public method on every scraper is decorated with @scraper_error_handler.
Here is the actual implementation:
def scraper_error_handler(func: Callable) -> Callable:
@wraps(func)
async def wrapper(self, *args, **kwargs):
try:
return await func(self, *args, **kwargs)
except BaseException as e:
error_msg = f"Error in {func.__name__}: {str(e)}"
logger.error(error_msg)
if os.getenv('GITHUB_ACTIONS') == 'true':
empty_df = pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])
empty_df.attrs['scraper_error'] = {
'error_type': type(e).__name__,
'error_message': str(e),
'function_name': func.__name__
}
return empty_df
return pd.DataFrame(columns=['company', 'vacancy', 'apply_link'])
return wrapper
Notice it catches BaseException, not just Exception. That was
intentional after I hit a case where a scraper raised a KeyboardInterrupt-like
signal in a subprocess context and killed the entire run. Now nothing escapes.
When running on GitHub Actions, the decorator also attaches error metadata to the returned
DataFrame's attrs dict. The manager checks for this marker to distinguish
"scraper found zero jobs legitimately" from "scraper crashed with an exception". Both return
an empty DataFrame, but only one is a bug.
Three Strategies for Modern Websites
Over time I've settled into a mental model with four tiers, ordered by preference:
- BeautifulSoup on static HTML (fast, cheap, reliable when it works)
- JSON API discovery via DevTools (fast, cheap, much more stable than HTML scraping)
__NEXT_DATA__extraction (elegant hack for Next.js government portals)- Playwright (last resort, slow, breaks, expensive in CI)
Strategy 1: BeautifulSoup on Static HTML
The majority of the 91 scrapers use this. Fetch a URL, parse the HTML, pull out title and link. Most Azerbaijani company career pages are server-rendered: they generate HTML on the backend and send it to the browser as-is. BeautifulSoup handles those easily.
The workflow is the same every time. Open DevTools in Chrome. Load the careers page. Right-click a job title and click "Inspect". Find the outermost container that repeats for each job. Identify what uniquely identifies it — a class name, a tag, an attribute pattern. Write the selector.
KPMG Azerbaijan is a clean example of this. Their careers page at
kpmg.com/az/en/home/careers/our-vacancies.html renders job listings as
<h6> elements containing anchor links. The scraper:
h6_elements = soup.find_all('h6')
for h6 in h6_elements:
link = h6.find('a', href=True)
if not link:
continue
title = h6.get_text(strip=True)
href = link.get('href', '')
if title and href:
jobs.append({
'company': 'KPMG Azerbaijan',
'vacancy': title,
'apply_link': href
})
The problem: this structure can change without warning. KPMG's page used to be a table in a
div.bodytext-data container. Then it became an h6 list. The scraper
now has three fallback strategies stacked: try the h6 structure first, then the
cmp-text-list__item list, then the legacy table. This is common across many scrapers.
Strategy 2: JSON APIs Found via DevTools Network Tab
This is the best outcome. When you find that a site is actually calling a JSON API under the hood, you can bypass the HTML entirely. The data is cleaner, the parsing is trivial, and the endpoint tends to be stable even when the frontend redesigns every six months.
Kapital Bank is the cleanest example. Their HR portal calls a public API endpoint:
https://apihr.kapitalbank.az/api/Vacancy/vacancies?Skip=0&Take=150&SortField=id&OrderBy=true.
That URL returns a JSON object with a data array. Each element has id,
header, location, employmentType, and deadLine.
The scraper is 20 lines long and has never broken.
SOCAR Downstream is slightly more interesting. Their vacancies page makes an AJAX POST to
/lazy_load_vacancies/1 with a page number and CSRF token in the body. The
response is a JSON object where files_html is an array of HTML strings
— each string being the rendered HTML for one vacancy card. You have to parse the HTML
from the JSON. Weird, but once you know the pattern it is completely reliable.
Azercell was a bug that turned into a lesson. Their careers site at
azercell.easyhire.me is paginated, and the URL to get JSON data is:
https://azercell.easyhire.me/job/search?json=true&page=0
The key parameter is ?json=true, not ?json. When I first wrote
the scraper I used ?json (without the value), and the server just returned HTML.
I spent an embarrassingly long time trying to parse the HTML before I caught the difference
in the network tab. The ?json=true form returns a clean JSON object with a
jobs array and a count total for pagination.
When hunting for these hidden APIs, I look for a few patterns in the network tab:
- XHR or Fetch requests to
/api/,/v1/,/v2/ - Requests with
Accept: application/jsonheaders - POST requests to
/graphql(which is its own mess, covered below) - URL parameters like
?format=json,?json=1,?output=json - Requests to third-party ATS providers: Huntflow, EasyHire, Oracle HCM, SAP SuccessFactors
Strategy 3: __NEXT_DATA__ Extraction
This is my favourite approach for government portals built on Next.js. Every Next.js page
embeds a <script id="__NEXT_DATA__" type="application/json"> tag in
the HTML containing the full server-side props that were used to render the page. This
includes the actual data — in the case of a vacancies page, the full list of jobs.
TABIB (the state healthcare management union) runs a Next.js site. The vacancies page
at tabib.gov.az/vetendashlar-ucun/vakansiyalar previously had CSS class
names like vacanycard__title__3fH9k — hashed identifiers generated by
CSS Modules that change every time the site is rebuilt. My original scraper broke after
their first post-launch deployment. I fixed it by ignoring the CSS entirely:
soup = BeautifulSoup(response, 'html.parser')
script_tag = soup.find('script', id='__NEXT_DATA__')
if script_tag:
next_data = json.loads(script_tag.string)
vacancies = (
next_data.get('props', {})
.get('pageProps', {})
.get('vacancies', [])
)
for v in vacancies:
title = v.get('name') or v.get('title') or v.get('positionName') or 'N/A'
slug = v.get('slug') or v.get('url') or v.get('id', '')
apply_link = urljoin(base_url, f"/vetendashlar-ucun/vakansiyalar/{slug}")
jobs.append({'company': 'TABIB', 'vacancy': title, 'apply_link': apply_link})
The CSS classes can change on every deployment. __NEXT_DATA__ almost never
changes structure unless the developer explicitly changes their data model. It is a much
more stable anchor than class names.
The pattern works on any Next.js site. You fetch the page HTML, find the script tag by
its id, parse the JSON, and navigate the props.pageProps tree to find your data.
The exact path varies by site, but the outer container is always the same.
Strategy 4: Playwright (The Nuclear Option)
When nothing else works, I reach for Playwright. It launches a real Chromium browser, navigates to the page, waits for JavaScript to execute, and then lets you extract content from the fully-rendered DOM.
The tradeoffs are severe:
- A Playwright scrape takes 20–60 seconds per site. A
fetch_url_asynccall takes 1–3 seconds. - Playwright requires a full Chromium install inside the Docker container, adding ~300MB to the image.
- It fails in ways that are hard to diagnose: browser launch timeout, selector not found, page navigation error, out-of-memory in a container.
- Sites that block scrapers based on browser fingerprinting are even better at detecting
headless Chromium than they are at detecting
aiohttp.
Currently only two scrapers use Playwright in production: busy.py (the
busy.az job board, which is a Next.js SPA with no server-rendered HTML) and
the Playwright fallback path in mckinsey.py (which rarely succeeds and is
effectively disabled).
The busy.az Playwright scraper looks like this in simplified form:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=['--no-sandbox', '--disable-setuid-sandbox']
)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 ...'
)
page = await context.new_page()
for page_num in range(1, 4):
url = f"https://busy.az/vacancies?page={page_num}"
await page.goto(url, wait_until='domcontentloaded', timeout=45000)
try:
await page.wait_for_selector('a[href*="/vacancies/"]', timeout=20000)
except PlaywrightTimeout:
await asyncio.sleep(3)
content = await page.content()
# ... parse with BeautifulSoup
I use --no-sandbox because the container runs as root in GitHub Actions and
Chromium refuses to start otherwise. The wait_for_selector call with a timeout
fallback handles the case where the SPA loads but the jobs section takes longer than expected.
The Problems Nobody Warns You About
CSS Class Hashes That Change Every Deploy
Modern frontend tooling — Next.js, Create React App, Vite — generates CSS Modules
by default. Class names like .VacancyCard_title__3fH9k contain a hash of the
source file or its contents. When a developer rebuilds the site after any CSS change, the
hash changes. Your selector breaks.
The TABIB situation was the clearest example. My first version of that scraper used:
vacancy_cards = soup.find_all('div', class_=lambda c: c and 'vacanycard' in ' '.join(c).lower())
That worked for about two weeks. After TABIB pushed a new release, the class names were
completely different. The __NEXT_DATA__ rewrite solved it permanently.
The general defense is to never rely on a class name that looks like it contains a hash.
Prefer tag-based selectors, data-* attributes, structural patterns (nth-child,
parent-child relationships), or embedded JSON data. If you absolutely must use a class name,
use substring matching rather than exact matching so minor changes don't break you.
GitHub Actions IPs Getting Blocked
GitHub Actions runners come from a well-known set of IP ranges that Microsoft/GitHub publishes publicly. Bot-detection systems — Cloudflare, Akamai, custom IP blocklists — know these ranges and block them.
Djinni.co is the most frustrating case. Locally, the djinni scraper works perfectly. It
fetches 30+ pages, returns hundreds of tech jobs from Ukrainian and Eastern European companies.
On GitHub Actions, it gets Errno 104 - Connection Reset by Peer on the first
request. Not a 403. Not a timeout. A TCP-level connection reset, which means the server
doesn't even want to complete the handshake.
I spent several hours trying to work around this. Rotating User-Agents didn't help —
the block is at the IP level. Adding realistic browser headers didn't help. Even adding
random delays between requests didn't help. The server simply refuses connections from
GitHub's IP space. Djinni is now in DISABLED_SCRAPERS.
hrcbaku.az has the same problem. Connection reset after ~464 seconds of trying,
which means GitHub Actions was burning almost 8 minutes per run on that scraper alone before
I disabled it.
The lesson: if a scraper works locally but consistently fails in CI with connection errors (not HTTP errors), it's almost certainly an IP block. There is no clean solution except using a residential proxy or running the scraper from your own server with a static IP.
Sites Changing Their HTML Structure Without Warning
I have no monitoring for "is this scraper's HTML selector still correct." The scrapers run, they either return data or they return zero rows, and I find out in the Telegram notification summary after the run.
The KPMG history is the clearest example of this. The site was originally a table:
<div class="bodytext-data">
<table>
<tbody>
<tr>
<td>Senior Associate</td>
<td><a href="https://kpmgcca.global.huntflow.io/apply/...">Apply</a></td>
</tr>
</tbody>
</table>
</div>
At some point KPMG redesigned their careers section. The table disappeared. Jobs were
now listed as <h6> elements with anchor links pointing to their Huntflow
ATS instead of the old PDF applications. The scraper returned zero for several days before
I noticed.
My current approach is to look at the zero-result list after each run and investigate anything that used to return jobs. If a scraper returns zero for more than three consecutive days, something is probably wrong. I haven't automated this check yet.
Cloudflare Challenges
Cloudflare's bot challenge works by serving a JavaScript puzzle instead of the actual page
content. Your scraper receives a 200 response containing the Cloudflare challenge HTML,
not the page you wanted. fetch_url_async considers this a success because the
status code was 200, and your scraper silently produces zero results because BeautifulSoup
can't find any job listings in a Cloudflare challenge page.
workly.az is currently blocked by Cloudflare on GitHub Actions. The site works
fine locally. In CI, every request gets the challenge page. There is no way around this
without either a real browser (Playwright) or a residential proxy. I disabled the scraper
rather than let it consume CI time returning zero.
The signature of a Cloudflare block in logs is: scraper completes in under 2 seconds,
returns zero jobs, no errors. Too fast and too clean. If you check the raw HTML response,
you will see something like <title>Just a moment...</title> and
references to challenge-platform.
GraphQL SPAs With No Documented API
boss.az is Azerbaijan's largest job board. It's built on Next.js with Apollo
GraphQL on the backend. Every page you see is rendered client-side, fed by GraphQL queries.
There is no server-rendered HTML with job listings.
I wrote a GraphQL scraper that tries three different query shapes against
https://boss.az/graphql:
query_attempts = [
{
"query": """
query GetVacancies($page: Int, $perPage: Int) {
vacancies(page: $page, perPage: $perPage) {
list {
id name positionName
profile { name logoUrl }
}
total
}
}
""",
"variables": {"page": 1, "perPage": 50}
},
# ... two more guesses
]
The endpoint responds, but none of the guessed schemas match the actual API schema. GraphQL returns structured error messages when the schema is wrong, so you know immediately that your query is invalid — but you don't know what the correct schema is without introspection, and Boss.az has introspection disabled. The scraper always returns zero and is now disabled.
The correct approach here would be to intercept the actual GraphQL queries the browser sends (via a proxy or Playwright's network interception API), capture the exact query format, and replay it. I haven't done this yet.
The DISABLED_SCRAPERS Graveyard
The ScraperManager maintains a set called DISABLED_SCRAPERS
(loaded from the database, with a hardcoded fallback):
DISABLED_SCRAPERS_FALLBACK = {
"projobs_vacancies", "boss_az", "workly_az", "bfb", "djinni",
"guavapay", "mckinsey", "its_gov", "isbu_az", "bp",
"tabib_vacancies", "hrcbaku",
}
Each of these scrapers has a story.
projobs_vacancies — Dead API
ProJobs is an Azerbaijani job board. The original scraper targeted their documented REST API
at core.projobs.az/v1/vacancies. At some point that endpoint simply started
returning 404. The API version was probably deprecated when they rewrote their backend.
I wrote a new version that tries several API candidate URLs, but none of them work. Disabled
pending someone figuring out their current API.
boss_az — GraphQL Schema Unknown
Covered above. Three query attempts, three failures, zero jobs, always. The scraper code exists and is technically functional — it just cannot discover the correct schema without introspection access or source code.
bp — Pure Algolia/JavaScript
BP's careers page for Azerbaijan is a React application that loads job listings from Algolia search on the client side. The server sends HTML that contains empty job-container divs and a bundle of JavaScript. The JavaScript makes authenticated Algolia API calls using a short-lived key embedded in the bundle.
I found the Algolia application ID and search key (RF87OIMXXP,
55a63aab6a8a8b6be5266a69f9275540) by reading the page source. But Algolia
search keys are scoped — this key is for the main bp.com site index,
not the careers index. Searching it for Azerbaijan jobs returns zero results because the
careers data lives in a different index. The key was also likely rotated since I captured it.
The honest summary in the scraper's logs:
- Page loads successfully (200 OK)
- Algolia search system detected
- 19 UI containers found for dynamic job loading
- No jobs found in static HTML (expected for dynamic sites)
BP genuinely may not have open positions in Azerbaijan most of the time. But even when they do, we can't get at them without either Playwright or a reverse-engineered Algolia key.
mckinsey — Playwright Timeout at 140 Seconds
The McKinsey scraper tries their documented API at
mckapi.mckinsey.com/api/jobsearch?cities=Baku first. When that fails (which it
usually does in CI because the endpoint requires specific origin headers), it falls back to
Playwright.
The Playwright fallback navigates to
mckinsey.com/careers/search-jobs?cities=Baku, waits for
li.job-listing selectors to appear, and then extracts the content. On a real
browser on a fast connection, this takes 5–10 seconds. On GitHub Actions in a Docker
container, the navigation timeout is set to 90 seconds. The selector wait is 45 seconds.
That's 135 seconds of potential waiting before a single scraper gives up.
Even worse, McKinsey's site uses aggressive bot detection. The Playwright-rendered page often gets a consent wall or a bot challenge that prevents the job listings from ever loading. The scraper times out at ~140 seconds having collected nothing. Disabled.
djinni — GitHub Actions IPs Blocked
Already covered. Works locally, gets connection-reset in CI. Djinni is a popular platform for Eastern European tech jobs and would be valuable to include. I'm watching for a solution that doesn't involve paying for residential proxies.
workly_az — Cloudflare Blocks
Also covered. The scraper code is clean and correct. The site simply refuses to serve content to GitHub Actions IPs.
bfb — Genuinely No Listings
BFB's careers page has no job listings. It is a CV submission form. There is nothing to scrape. The file exists because someone asked me to add BFB as a source before I discovered this. Disabled, not broken.
isbu_az and its_gov — Structural Changes and Timeouts
isbu.az changed their CSS class structure at some point. The scraper's
primary selector (a.vacancies__item) no longer matches anything. There is a
fallback that tries to find any anchor with /vakansiya/ in the href, but the
site now has enough client-side rendering that this also returns nothing from the static HTML.
its.gov.az (the government IT agency) times out at ~192 seconds. The server
accepts the connection but never sends a complete response. This is either extremely slow
server-side rendering or the server is down but keeps connections open. Either way, it burns
3 minutes of CI time per run.
Making It Reliable
@scraper_error_handler — Blast Radius Containment
The fundamental guarantee that the system provides: one scraper crashing cannot affect
any other scraper. Without this, a single unhandled exception in the KPMG scraper would
propagate up through asyncio.gather and potentially terminate the entire run.
The decorator provides this guarantee by catching everything at the method level, before
the exception has a chance to reach the caller. The asyncio.gather in
ScraperManager.run_all_scrapers uses return_exceptions=True as
a second line of defense, but the decorator should make that unnecessary for well-formed
scrapers.
The BaseException catch is deliberate. Python's Exception class
does not catch SystemExit, KeyboardInterrupt, or
GeneratorExit. In an async context with Playwright or subprocess calls, these
can appear in unexpected places. BaseException catches everything.
fetch_url_async — The HTTP Layer
fetch_url_async in BaseScraper wraps aiohttp with
several layers of defence:
User-Agent rotation. There is a pool of 10 realistic User-Agent strings covering Chrome on Windows, Chrome on Mac, Chrome on Linux, Firefox on Windows, Firefox on Mac, Safari on Mac, and Edge on Windows. Each request picks one at random. This is not sophisticated bot evasion, but it prevents the most naive User-Agent-based filters.
Exponential backoff for rate limiting. On 403, 429, 503, 502, and 504 responses, the function waits and retries. In GitHub Actions mode:
wait_time = min((3 ** attempt) + random.uniform(2, 5), 20)
In local mode it uses a gentler 2 ** attempt base. Up to 3 retry attempts
are made before giving up.
Content-type aware decoding. If the response Content-Type is
application/json, the function calls response.json() and returns
a dict. Otherwise, it reads the bytes and decodes them, trying UTF-8 first and falling
back to chardet for automatic encoding detection. This matters for Azerbaijani
sites that sometimes serve content in Windows-1252 or ISO-8859-1.
Configurable timeouts. 60 seconds in CI, 45 seconds locally. Connect timeout of 20 seconds in both cases. These numbers came from experience: some government portals are genuinely slow and need more than the default 30-second total timeout.
Concurrency Management
Running 91 scrapers fully concurrently would exhaust the connection pool and likely get
several sites rate-limited simultaneously. The semaphore in run_all_scrapers
limits active scrapers:
if self.is_github_actions:
max_concurrent = min(max_concurrent, 2) # Reduce to 2 in CI
semaphore = asyncio.Semaphore(max_concurrent)
async def run_with_semaphore(scraper_name):
async with semaphore:
if self.is_github_actions:
await asyncio.sleep(random.uniform(3, 8))
else:
await asyncio.sleep(random.uniform(0.5, 2))
return await self.run_single_scraper(scraper_name, session)
The concurrency of 2 in CI was reached empirically. Higher values caused cascading failures where multiple scrapers would time out simultaneously, filling the connection pool, and triggering more timeouts. 2 is conservative but has been stable.
The random sleep before each scraper starts (3–8 seconds in CI) staggers the initial burst. Without it, all scrapers start simultaneously and hit their target sites at the same moment, which looks much more like a bot than staggered requests.
The aiohttp.TCPConnector is configured with a global limit of 100 connections
and 15 per host in CI:
connector = aiohttp.TCPConnector(
limit=100,
limit_per_host=15,
ttl_dns_cache=300,
use_dns_cache=True,
keepalive_timeout=60,
enable_cleanup_closed=True
)
DNS caching (300-second TTL) reduces lookup overhead on sites with multiple subdomains.
enable_cleanup_closed prevents file descriptor leaks over the course of a long
run.
The Upsert Pattern
The database write is not a simple insert. Each run performs a full upsert using
apply_link as the stable identifier:
INSERT INTO scraper.jobs_jobpost
(title, company, apply_link, source, last_seen_at, is_active)
VALUES %s
ON CONFLICT (apply_link) DO UPDATE SET
title = EXCLUDED.title,
company = EXCLUDED.company,
source = EXCLUDED.source,
last_seen_at = NOW(),
is_active = TRUE
RETURNING (xmax = 0) AS is_new
The (xmax = 0) AS is_new trick is a Postgres-specific way to determine
whether the ON CONFLICT branch ran or the INSERT branch ran.
xmax is 0 for newly inserted rows; for updated rows it contains the transaction
ID of the updating transaction. This gives me accurate new/updated counts without a
separate query.
After upserting, a second query soft-deletes jobs from this source that were not seen in the current run:
UPDATE scraper.jobs_jobpost
SET is_active = FALSE
WHERE source = %s
AND apply_link NOT IN %s
This means a job that was live yesterday but missing today gets flagged as inactive rather
than deleted. Useful for distinguishing "company filled the position" from "scraper had a
hiccup." The frontend only shows is_active = TRUE jobs.
Before writing to the database, the scraper also deduplicates by normalized title within
each run. The normalization strips parenthetical content
('Frontend Developer (React)' becomes 'frontend developer'),
removes location suffixes after a dash
('Senior Engineer - Bakı' becomes 'senior engineer'), and
collapses whitespace. This catches the case where the same job is listed twice on the same
site with slightly different formatting.
GitHub Actions as Free Infrastructure
The entire scraping pipeline runs on GitHub Actions. There is no server, no cron job on
a VPS, no Celery worker. The workflow file at
.github/workflows/scraper.yml schedules three runs per day:
on:
schedule:
- cron: '0 6,9,13 * * *'
That is 06:00, 09:00, and 13:00 UTC, which corresponds to 10:00, 13:00, and 17:00 in Azerbaijan (UTC+4). Three times daily is frequent enough to catch same-day job postings without hammering the target sites.
The workflow checks out the repository, builds a Docker image from
scraper/Dockerfile, and runs it with the database URL and Telegram credentials
injected as environment variables from GitHub Secrets.
The Docker build is the slowest step. Playwright requires a full Chromium install.
Installing Python packages from requirements.txt takes time. Without caching,
the build step alone took 6–8 minutes per run.
The solution is Docker layer caching via GitHub Actions cache:
- name: Cache Docker layers
uses: actions/cache@v4
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ hashFiles('scraper/requirements.txt') }}
restore-keys: |
${{ runner.os }}-buildx-
The cache key is based on the hash of requirements.txt. When requirements
don't change (which is most runs), the full cached image layers are restored and the
build takes about 45 seconds. Only when I add or update a dependency does it rebuild from
scratch.
The workflow also uses the --cache-to / --cache-from flags with
mode=max:
docker buildx build \
--cache-from=type=local,src=/tmp/.buildx-cache \
--cache-to=type=local,dest=/tmp/.buildx-cache-new,mode=max \
--load \
-t birjob-scraper:latest \
.
The mode=max caches all intermediate layers, not just the final image. This is
important because the Playwright install layer and the pip install layer are
in the middle of the Dockerfile, and without mode=max they would not be cached
individually.
The math on GitHub Actions minutes: the free tier provides 2,000 minutes per month. Three runs per day at ~20 minutes each is 60 minutes per day, about 1,800 minutes per month. That fits within the free tier for a public repository. For a private repository the free tier is only 2,000 minutes, which means a couple of days with slow Playwright scrapers or failed retries could burn through the quota. I keep the repository public partly for this reason.
Monitoring: Telegram After Every Run
After each run completes, a Telegram message goes to a private channel with a summary of what happened. The message includes:
- Total scrapers run vs. total successful
- Number of new jobs added to the database
- Number of jobs updated (seen again)
- Number of jobs deactivated (not seen in this run)
- Total jobs currently in the database
- List of scrapers that returned zero results
- List of scrapers that errored, with error types
- Total run duration
This is the primary way I notice when something breaks. If I see that
kapitalbank returned zero for the first time in months, I know to go check
their API. If I see a scraper listed under "errored" with an AttributeError,
the site probably changed structure.
The GitHub Actions workflow also emits structured annotations using the
::error title=...::message syntax. These appear inline in the Actions UI
with red warning icons next to the specific step. For critical failures this is useful,
but for routine "this scraper returned zero" events the Telegram notification is easier
to scan.
What I don't have: alerting when a previously-healthy scraper starts consistently returning zero. Right now I notice this manually after a few days. A proper monitoring setup would track historical counts per scraper and alert when a scraper that typically returns 50 jobs starts returning 0 for multiple consecutive runs. That's on the backlog.
What I'd Do Differently
After a year of running this, here is what I would change if I started over:
Use residential proxies from day one for the high-value blocked sites. Djinni, boss.az, and a few others are blocked at the IP level in GitHub Actions. I've been ignoring this problem because the proxy cost felt unjustifiable for a free service. But djinni alone would add hundreds of tech job listings that are currently invisible on BirJob. The ROI is probably positive if even a fraction of those users convert to paying customers.
Store raw HTML alongside structured data. Right now, when a scraper breaks because a site changed structure, I have no way to look at what the page looked like when it was last working. If I stored the raw response (even just for a few days), debugging structure changes would be much faster.
Write integration tests that run against real URLs on a weekly schedule. The kind of tests that don't mock anything: actually fetch the page, actually parse it, verify that the result is a non-empty DataFrame with plausible data. This would catch structural changes within days instead of weeks.
Build a proper health dashboard. The Telegram summary is useful but linear. What I want is a table showing, per scraper, the last 30 days of job counts. A scraper that was returning 40 jobs and now returns 5 is showing a warning sign even if it hasn't broken yet. This would help me catch degraded-but-not-zero scrapers, which are currently invisible.
Never use CSS class names that look like they contain hashes. I knew this was fragile from the start but built a few scrapers that way anyway because it was faster. Every one of them has broken at least once. Structural selectors (nth-child, known tags, data attributes) are more work upfront but far more stable.
Consider GraphQL introspection early. When I encounter a GraphQL site now, I try to run an introspection query first to learn the schema. Many GraphQL endpoints have introspection disabled in production but enabled in staging, or have it disabled but ship the schema definition in their JavaScript bundle. Ten minutes of reading bundle output would have saved the hours I spent guessing boss.az query shapes.
Use the __NEXT_DATA__ approach by default for any Next.js site.
I still sometimes try CSS-based selectors first on Next.js sites out of habit. I should
flip this: always check for __NEXT_DATA__ first, and only fall back to
CSS selectors if there is no useful data in the JSON. The JSON path is always more stable.
Rate-limit per domain, not globally. The current semaphore limits total concurrent scrapers to 2. This means we're never making more than 2 requests to any single site at once. But it also means scrapers targeting different sites can't run simultaneously while one slow scraper is blocking a semaphore slot. A per-domain rate limiter would let the fast scrapers finish quickly while the slow ones churn in the background.
Conclusion
At 91 sources, the system works well enough. The morning run starts at 06:00 UTC and most users see fresh data by the time they sit down at their desks in Baku. Companies like Kapital Bank, SOCAR, Azercell, Azerconnect, and dozens of others are checked multiple times per day. New jobs typically appear on BirJob within a few hours of being posted.
The architecture is simple: one file per source, a shared base class, a decorator that contains failures, and a manager that orchestrates everything. The complexity lives in the individual scrapers, not in the framework. Adding a new source is genuinely 20–30 minutes of work. Fixing a broken scraper is usually faster than that.
What makes this kind of project interesting to maintain is that the internet doesn't sit still.
Sites redesign, APIs change, bot detection improves, domains go offline, companies switch ATS
providers. The DISABLED_SCRAPERS graveyard is a record of all the ways the open
web fights back. Most of those battles are losing ones — if a site with Cloudflare
protection doesn't want to be scraped from a cloud IP, there isn't much I can do without paying
for infrastructure.
But the majority of Azerbaijan's job market is accessible: company career pages, government portals, regional job boards, and local startups. Most of them don't have aggressive bot protection because they don't need to. They want their listings to be visible. BirJob is in the business of helping people find those listings. For that job, 91 scrapers running three times a day is good enough.
BirJob.com is Azerbaijan's job aggregator. If you're a company that wants your listings included, or if you've noticed your site is missing, feel free to reach out via birjob.com.
You might also like
- How I Built Azerbaijan's Biggest Job Platform on a $0 Infrastructure
- Building a Web Scraper That Runs 91 Job Sites Daily — A Technical Deep-Dive
- Will AI Replace Jobs in Azerbaijan? Here's What the Hiring Data Actually Shows
- QA Engineer vs SDET vs Test Automation Engineer: The Testing Career Nobody Explains
