I used AI to help curate and structure my findings in this article — but every decision, every scraped website, every late-night debugging session described here is real. This is my story.
There's a particular kind of frustration that Azerbaijani job seekers know well.
You open hellojob.az. Scroll through. Nothing great. Switch to vakansiya.az. Scroll again. Maybe try banker.az — wait, is that only for finance? Back to Google. Click three more sites. By this point you've had seven tabs open and you can't remember which job you already saw on which site.
I got tired of watching this happen. So one weekend I thought: what if I just aggregated all of them?
That was the beginning of BirJob. What followed was six months of building in the margins of my life — evenings, weekends, the odd long commute. No funding. No team. No budget. Just a problem that annoyed me and a strong enough stubbornness to do something about it.
Here's how the whole thing is built, what it cost (spoiler: nothing to start), and what I learned along the way.
First, the Numbers
Before we get into the weeds — here's where BirJob stands today:
- 50+ sources scraped automatically, every day
- 4,000+ active vacancies at any moment
- 3 scrape runs per day — 10:00, 13:00, 17:00 Baku time
- ~$45/month in running costs after free tiers ran out
- $0 to validate the entire concept
That last point matters. I didn't spend a cent until I knew people were actually using it.
The Tech Stack (all of it)
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 14 (App Router) | SSR, ISR, API routes |
| Styling | Tailwind CSS | Utility-first CSS |
| ORM | Prisma | Type-safe DB client |
| Database | Neon PostgreSQL | Serverless Postgres |
| Hosting | Vercel | Edge deployment |
| File Storage | Cloudflare R2 | CV + image storage |
| Resend | Transactional email | |
| Payments | Epoint | Azerbaijani payment gateway |
| Auth | JWT (custom) | Session management |
| Scraper | Python + aiohttp | Async HTTP scraping |
| HTML parsing | BeautifulSoup4 | Static HTML parsing |
| Data processing | pandas | DataFrame operations |
| Browser automation | Playwright | SPA scraping |
| Scraper runtime | Docker + GitHub Actions | Free scheduled CI jobs |
| Internal analytics | PostgreSQL (custom tables) | Event + search logs |
| External analytics | Google Analytics 4 + GA4 Data API | Traffic reporting |
| Notifications | Telegram Bot API | Job alerts + admin pings |
The Architecture (in one picture)
I wanted to build this in a way where if the scraper exploded at 3am, the website would still be up. And if I pushed a broken frontend deploy, the scraper would keep running anyway. Complete decoupling.
┌──────────────────────────────────────────────────────────┐
│ GitHub Actions │
│ Python scraper (Docker) fires 3x/day │
│ → writes to Neon PostgreSQL (scraper schema) │
└──────────────────────────────────────────────────────────┘
│
▼ (shared database, nothing else)
┌──────────────────────────────────────────────────────────┐
│ Next.js 14 on Vercel │
│ Server Components, Prisma ORM, API routes │
│ → Cloudflare R2 for files │
│ → Resend for email │
│ → GA4 Data API in admin dashboard │
└──────────────────────────────────────────────────────────┘
The scraper doesn't know Next.js exists. The frontend doesn't know how scraping works. They share exactly one thing: a Postgres database. That's it. This turned out to be the most important architectural call I made.
Part 1: The Scraper
"One file, one source" — and why it saved me dozens of times
I have 91 Python files in scraper/sources/. Each one is a single class that knows how to scrape one website. Kapital Bank has its own file. Hellojob has its own file. SOCAR Downstream has its own file.
Every class extends BaseScraper and looks roughly like this:
class KapitalBankScraper(BaseScraper):
@scraper_error_handler
async def scrape_kapitalbank(self, session):
html = await self.fetch_url_async(
"https://www.kapitalbank.az/careers", session
)
soup = BeautifulSoup(html, "html.parser")
# parse the page, collect jobs
return pd.DataFrame(rows, columns=["company", "vacancy", "apply_link"])
Three things make this pattern work:
@scraper_error_handler is a decorator that catches everything — any exception, any edge case — and returns an empty DataFrame instead of crashing. If Kapital Bank's website goes down at 2pm, the other 90 scrapers keep running like nothing happened. The morning I added this decorator was the morning I stopped getting 3am error notifications.
fetch_url_async is the shared HTTP method that handles all the annoying real-world stuff: rotating User-Agent strings, exponential backoff with jitter on failures, different timeout profiles for local vs. GitHub Actions, and auto-detecting whether the response is JSON or HTML.
pandas.DataFrame — every single scraper returns the same three columns. Combining 91 scrapers is just pd.concat(). One line.
The scrapers I had to give up on
Not every site plays nice. I keep a DISABLED_SCRAPERS set for sources that are broken in ways I can't easily fix:
DISABLED_SCRAPERS = {
"djinni", # GitHub Actions IPs get blocked — returns 403
"boss_az", # GraphQL query format unknown — always 0 results
"bp", # Pure Algolia JS, no static content anywhere
"mckinsey", # Playwright times out at ~140 seconds
...
}
The bp one stings a little. BP Azerbaijan is one of the biggest employers in the country. But their careers page is 100% client-side JavaScript with no scrape-able static content. Some sites just don't want to be scraped, and that's fine.
Three ways I handle modern websites
BeautifulSoup on static HTML — covers the majority of sites. Fast, reliable, easy to debug.
JSON APIs — some companies' career pages hit an internal API. When the response is Content-Type: application/json, it gets auto-parsed into a dict. Some of these APIs are documented nowhere; I found them by opening DevTools and watching the network tab. Azercell, for instance, just needs a ?json=true query parameter that nobody told anyone about.
__NEXT_DATA__ extraction — this one I'm particularly proud of. Several corporate sites are Next.js SPAs. Running Playwright for all of them would be slow and fragile. But Next.js bakes all the server-rendered page data into a <script id="__NEXT_DATA__"> tag in the HTML. Parsing that JSON is instant, requires no browser, and gives you everything. It's the lazy person's way to scrape a Next.js site, and it works beautifully.
Playwright — the nuclear option. A real headless Chromium browser for sites that are genuinely fully dynamic. I use it sparingly. When a Playwright scraper starts timing out consistently, it joins DISABLED_SCRAPERS.
GitHub Actions as a free cron scheduler
The whole scraper runs inside Docker, triggered by GitHub Actions three times a day:
on:
schedule:
- cron: '0 6,9,13 * * *' # 10:00, 13:00, 17:00 Baku time (UTC+4)
The trick that makes this practical is Docker layer caching. Installing all the Python dependencies takes 3–4 minutes from scratch. With actions/cache caching the build layers between runs, it's under 30 seconds.
Three runs/day × ~8 minutes = ~720 minutes/month. The free plan gives 2,000 minutes for private repos. I use 36% of it.
Documented limits (GitHub Actions billing docs): Public repos — unlimited free minutes. Private repos (Free plan) — 2,000 minutes/month, Linux at 1× multiplier. Storage — 500 MB. Concurrent jobs — 20. Overage — $0.008/minute for Linux.
After each run, I get a Telegram message:
🆕 142 new ✅ 891 updated 🙈 23 hidden
📦 4,234 total ⏱ 4m 12s
That's my morning coffee read.
Part 2: The Database
Two schemas, one connection string
I use Neon — serverless PostgreSQL. The "serverless" part matters: when nothing is querying the database (like at 3am), compute scales to zero and costs nothing. Cold start is ~500ms, which is fine for an async scraper and a frontend with connection pooling.
The database has two PostgreSQL schemas inside one Neon project:
neon_db
├── scraper (schema) ← Python owns this
│ ├── jobs_jobpost
│ └── scraper_errors
└── website (schema) ← Next.js owns this
├── users
├── sponsored_jobs
├── job_applications
├── blog_posts
├── web_event_logs
├── search_logs
├── email_broadcasts
├── telegram_subscribers
└── supporters
When I change the website schema — add a new analytics table, alter the users model — it doesn't touch anything the scraper cares about. Migrations are safe to run at any time.
Documented limits (Neon pricing): Storage — 0.5 GiB on Free plan. Compute — 191.9 compute-hours/month (~5 active hours/day at 0.25 vCPU). Projects — 1. Branches — 10. No autoscaling on Free plan. The database currently sits at ~180 MB. Comfortable. The compute budget is the one to watch if traffic spikes.
The upsert that powers everything
Scraping the same job twice is unavoidable. I needed a smart upsert:
INSERT INTO scraper.jobs_jobpost
(title, company, apply_link, source, last_seen_at, is_active)
VALUES %s
ON CONFLICT (apply_link) DO UPDATE SET
title = EXCLUDED.title,
company = EXCLUDED.company,
last_seen_at = NOW(),
is_active = TRUE
RETURNING (xmax = 0) AS is_new
The apply_link is the natural unique key. The RETURNING (xmax = 0) is a neat PostgreSQL trick: xmax = 0 means no previous transaction touched this row, so it's a fresh insert. That single flag lets me count "new jobs" vs "jobs we already had" without an extra query.
After each upsert batch, I soft-delete anything not seen in the current scrape:
UPDATE scraper.jobs_jobpost
SET is_active = FALSE
WHERE source = %s
AND apply_link NOT IN %s
The job never gets deleted. The page never 404s. It quietly disappears from search results, gets a noindex meta tag, and its Schema.org validThrough date is set to when it was last seen. Google understands the posting has expired. Nobody gets a broken link.
Part 3: The Frontend
Next.js 14 — because I didn't want to think about caching
Server Components let me query Postgres directly in the component and ship fully-rendered HTML to the browser. No loading state, no skeleton screens for content that never changes per-user.
The homepage gets cached and revalidated every 10 minutes. Blog posts every hour. A thousand simultaneous homepage visits triggers exactly one database query every 10 minutes.
export const revalidate = 600; // homepage: 10 minutes
export const revalidate = 3600; // blog posts: 1 hour
Documented limits (Vercel pricing): Bandwidth — 100 GB/month on Hobby. Build execution — 6,000 minutes/month. Serverless function execution — 100 GB-hours/month. Image optimizations — 1,000/month on Hobby.
⚠️ The gotcha nobody warns you about: Vercel's Hobby plan Terms of Service restrict it to "non-commercial, personal projects." The moment BirJob started charging companies for sponsored job postings, it became commercial — which technically requires the Pro plan ($20/month). If you're building something that takes payments, plan for this from day one.
Auth in 80 lines of code
I wrote JWT-based auth from scratch. Sign a token, set it as an httpOnly cookie, verify it in middleware. That's it. 80 lines of TypeScript, zero monthly cost, complete control. Email unsubscribe links use HMAC signatures — one-click opt-out, no login required, nobody can forge requests for someone else's address.
SEO: the thing that actually drives traffic
A job aggregator lives or dies on organic search. A few things I do intentionally:
Every job has a proper page at /jobs/[id] with structured data, a canonical URL, and a title Google can index.
Schema.org JobPosting on every listing — active jobs get validThrough: today + 30 days. Inactive jobs get validThrough: last_seen_at (in the past), which tells Google the posting expired without triggering a 404.
The sitemap is dynamic — it queries Postgres on every regeneration. Fresh jobs get priority: 0.7, changeFrequency: daily. Older jobs get priority: 0.5, changeFrequency: weekly. Inactive jobs are excluded entirely.
Analytics: I built my own instead of paying for Mixpanel
Every meaningful user action writes a row to web_event_logs. Next.js middleware sets a birjob_sid session cookie (32 hex chars, 24-hour TTL) on every request. This stitches together the anonymous visitor journey — first visit → search → register → apply — without storing any personal data.
Search queries go to a dedicated search_logs table with a results_count column. One SQL query shows me every search that returned zero results. That table is my product roadmap. When "Azercosmos" appears 40 times with zero results, that's my next scraper to build.
On top of that, I integrated the GA4 Data API directly into the admin panel using a Google Cloud service account. The admin dashboard pulls both internal metrics (registrations, applications, revenue — from Postgres) and traffic metrics (sessions, bounce rate, top pages — from GA4) into one screen. No tab switching.
Documented limits (GA4 Data API quotas): 200,000 requests/day, 40,000/hour. Data freshness — 24–48 hour processing delay for standard reports. Data retention — configurable up to 14 months on free properties. For a dashboard refreshed a few times per day, these limits are effectively infinite.
Part 4: Everything Else
Email: Resend
Resend handles all outgoing email — verification, job alerts, password resets, HR notifications, admin broadcasts.
Documented limits (Resend pricing): 3,000 emails/month, 100/day on Free plan. 1 custom domain. API rate limit — 2 requests/second. No webhooks on Free plan. When you hit the daily cap, you stop sending — no overage charges, no surprises. Just silence, which is its own kind of surprise if you're not monitoring it.
Payments: Epoint
Companies pay to place sponsored job listings. I'm using Epoint — Azerbaijan's local payment gateway — because it works with Azerbaijani bank cards and doesn't charge a monthly platform fee. The integration is a webhook: Epoint POSTs to my callback endpoint, I verify the HMAC signature, and the job goes live. No Stripe. No $10–20/month platform fee. Zero cost until money comes in.
File Storage: Cloudflare R2
CV files and blog images live on Cloudflare R2. The reason I chose R2 over S3 is simple: R2 has zero egress fees. S3 charges $0.09/GB to read your own files. For a platform where every CV download is an outbound read, that math adds up fast.
Documented limits (Cloudflare R2 pricing): Storage — 10 GB/month free. Class A ops (writes) — 1M/month free. Class B ops (reads) — 10M/month free. Egress — $0 always, forever. Overage — $0.015/GB storage.
Telegram: the best free notification system that exists
Two uses: my personal admin feed after every scraper run, and @birjob_bot for user job alerts. Users subscribe with keywords. When a matching job is scraped, the bot sends a direct message. This is a real-time push notification system that costs exactly nothing and doesn't require a mobile app.
Documented limits (Telegram Bot FAQ): 30 messages/second globally, 1 message/second per chat. Completely free, no tiers, no quotas beyond rate limits. If the platform scales to where sending to thousands of users hits the 30/sec limit, I'll need a message queue. Good problem to have.
What It All Costs
| Service | Free Limit | After Free Tier |
|---|---|---|
| GitHub Actions | 2,000 min/month (private repos) | $0.008/min |
| Neon PostgreSQL | 0.5 GiB, 191.9 compute-hours/month | ~$19/month |
| Vercel | 100 GB bandwidth ⚠️ non-commercial only | ~$20/month (Pro) |
| Cloudflare R2 | 10 GB storage, 10M reads, $0 egress | $0.015/GB |
| Resend | 3,000 emails/month | $20/month (Pro) |
| GA4 | 200k API calls/day | Free forever |
| Telegram Bot API | 30 msg/sec, unlimited bots | Free forever |
| Epoint | No monthly fee | Per-transaction % |
To launch: $0. Running costs today: ~$45/month. Less than what most people spend on subscriptions they forgot about.
What I'd Do Differently
I'd plan for Vercel Pro from day one. The Hobby plan says "non-commercial" in the terms. I glossed over that. The moment you add a checkout flow, you're commercial. Budget $20/month for Vercel if your product makes money.
I'd add search_logs even earlier. I built it a few months in and immediately had better product insight than months of guessing. The zero-result searches alone changed what I worked on next.
I'd be more brutal about disabled scrapers. I spent too much time trying to make scrapers work that were fundamentally hostile to scraping. Some sites just don't want to be scraped. Move on.
I'd separate is_active and soft-delete from the start. Early on I deleted stale jobs. That created dead URLs, hurt SEO, and broke bookmarks. Soft-delete from day one would have saved me a painful migration.
What's Next
The search_logs table drives the roadmap. Every week I run one query:
SELECT query, COUNT(*) as searches
FROM search_logs
WHERE results_count = 0
AND created_at > NOW() - INTERVAL '7 days'
GROUP BY query
ORDER BY searches DESC
LIMIT 20;
Whatever appears at the top — that's what I build next.
The business model isn't complicated: companies pay to pin their job listings at the top of search results. The more organic traffic BirJob generates, the more valuable those sponsored slots become. SEO compounds quietly in the background while the scraper runs three times a day without me touching anything.
If you're thinking about building something similar — a niche aggregator, a local search engine, a vertical marketplace — my honest advice is this: don't let the infrastructure planning stop you from shipping. GitHub Actions, Neon, Vercel's Hobby plan, and Cloudflare R2 give you a production-grade stack for free. The 91 scrapers took time. The SEO took patience. The infrastructure took an afternoon.
Start with one source. Get it working. Ship it. Then add the next one.
You might also like
- How I Scrape 91 Websites Every Day Without Getting Blocked
- Building a Web Scraper That Runs 91 Job Sites Daily — A Technical Deep-Dive
- DevOps vs SRE vs Platform Engineer: The Infrastructure Title Mess, Explained
- The Microservices Trap: When Monoliths Are Actually Better (With Real-World Case Studies)
