Dynamic Polling Scheduler 鈥� Full Design

Hritik Karwasra Updated 16 Sep, 2025

Summarize :

Article guided by: and

Polling millions of shipments every few hours with a single cron job creates multiple problems鈥攂urst traffic that overwhelms courier APIs, no per-account control, and minimal visibility into failures. To solve this, we built the Dynamic Polling Scheduler, a continuous, account-aware system that spreads the load evenly, guarantees exactly-once message delivery, and provides granular observability. In this article, we鈥檒l explain the limitations of the previous approach, the design of the new scheduler, and how it ensures reliable, scalable, and crash-safe shipment polling.

1. Problem Statement (Why we鈥檙e changing things)

Burst traffic: The current 2-hour cron pulls ~10 M open shipments in one shot, pushes them to Kafka, and subjects courier partners to massive QPS spikes.
No per-account control: Every customer is polled on the same 2 h cadence, whether they need it or not.
Coarse observability: We only know something鈥檚 wrong every two hours.
Missed SLAs: Rate-limit blocks can delay polling beyond the 2 h window.

Goal

Move to a continuous, account-aware scheduler that:

Polls each shipment on its own interval (default = 120 min, configurable per account).
Spreads the load evenly every 5 min to keep partner APIs happy.
Guarantees 鈥渆虫补肠迟濒测-辞苍肠别鈥� enqueue, zero data loss on crashes.
Gives granular, always-on alerting and a lean storage footprint.
Concurrency-safe under many replicas.
Observability without logs.

2 Requirements (concise)

No	Must / Should	Requirement
1	MUST	Poll every active shipment until Delivered or 60 days.
2	MUST	Allow per-account polling_interval_minutes, default 120.
3	MUST	Slice each interval into 5-min batches (24 slices per 2 h).
4	MUST	Exactly-once enqueue to Kafka; idempotent consumers.
5	MUST	Crash-safe; recover by replaying an outbox.
6	SHOULD	Alert at account level when lag > 110 % of interval.

3 High-Level Architecture

3.1. Component Overview

Selector Worker (Ingress Control)

Runs every 1 minutes.
Selects shipments that are due for polling (next_poll_at <= NOW()) from the shipment_polling_tracker table.
Inserts a simple record for each due shipment into the scheduler_outbox table.
Commits and exits quickly to keep database locks short and efficient.

Unscheduled Shipments Updater Worker (Egress Control)

Continuously checks the table shipment_polling_tracker for new shipments to be polled(Fresh AWB鈥檚 Registered)
Fetches new shipments (with next_poll_at as NULL), assigns a randomized next_poll_at within the account鈥檚 polling interval to spread out load.
Updates next_poll_at in shipment_polling_tracker using the polling interval from polling_schedule_config.

Scheduled Shipments Updater Worker (Egress Control)

Periodically reviews all shipments in the shipment_polling_tracker table.
For polled shipments, recalculates and updates next_poll_at based on the last poll time and the account鈥檚 interval, adding a small random jitter to avoid future spikes.
Ensures that polling times are evenly distributed, preventing traffic bursts and keeping the system smooth.

Kafka Shipment Tracking

Receives shipment polling events from the Flusher Worker.
Guarantees exactly-once delivery to all downstream consumers, even in the event of retries or failures.

Courier Consumer Pool

Continuously polls Kafka for new shipment events.
Makes HTTP requests to courier partner APIs to fetch the latest shipment status.
Writes status updates and scan events to the next queue in the pipeline for further processing.

3.2. Detailed Sequence (5-min slice)

Exactly-Once Messaging.

The Flusher uses Kafka鈥檚 idempotent-producer protocol keyed on shipment_id, which means a resend after a crash is automatically de-duplicated at the broker level. Downstream consumers therefore see each shipment message once, regardless of partial failures.
Load Smoothing.

By slicing every account鈥檚 interval into 24 equal five-minute windows, the system spreads API calls evenly instead of issuing two-hour bursts. Courier-facing QPS is flat, staying within partner rate limits.
Horizontal Scalability.

Both Selector and Flusher are stateless. Additional replicas can be deployed to absorb higher throughput simply by increasing Nomad counts; row-level locks prevent duplicate processing.
Latency Budget.

Control-plane transactions target sub-200 ms per five-minute slice.
Data-plane latency is bounded only by courier round-trip times; polling freshness SLO is 1.1 脳 configured interval for 99% of shipments.

3.3 Logical Data Model (new tables)

Key Benefits

Eliminates courier rate-limit spikes.
Provides per-account configurability without schema changes to downstream consumers.
Provides per-account configurability without schema changes to downstream consumers.
Reduces mean-time-to-detect (MTTD) for polling issues from 2 hours to < 10 minutes.

Race Condition Handling with JobConfig

Instead of relying on row-level locks like FOR UPDATE SKIP LOCKED, we use a JobConfig table as a distributed lease manager.

CREATE TABLE job_config (
    job_name        TEXT PRIMARY KEY,
    slice_start     TIMESTAMP,
    slice_end       TIMESTAMP,
    last_started_at TIMESTAMP,
    last_completed_at TIMESTAMP,
    running         BOOLEAN DEFAULT false,
    version         BIGINT
);

Claiming a Job

Each worker replica tries to claim a job slice:

UPDATE job_config
SET running = true,
    last_started_at = NOW(),
    version = version + 1
WHERE job_name = :job
  AND running = false
  AND version = :current_version;

If update count = 1 鈫� lease acquired.
If update count = 0 鈫� another worker already owns it.

Preventing Races

Selector Jobs
- Only one replica selects shipments per slice.
- Prevents double enqueueing of the same shipments.
Flusher Jobs
- Only one replica flushes a batch from the outbox.
- Guarantees exactly-once enqueue into Kafka.
Updater Jobs
- Ensures only one rescheduler run per slice.
- Prevents next_poll_at overwrite collisions.

Catch-Up Logic

If a job is missed (e.g., worker crash), the next worker checks for slice_end < NOW() AND running=false.
Missed slices are replayed in order until the schedule is caught up.

Keeping Our Queues Healthy: How We Detect Scheduler & Consumer Issues

In large-scale systems like ours, shipments move through a complex network of jobs, queues, and schedulers. To keep everything running smoothly, we鈥檝e built two streams of alerts鈥攐ne for the producer side (the scheduler that puts work into queues) and one for the consumer side (the workers that pull messages out of queues).

Think of it like a factory:

Producer alerts make sure the conveyor belt is loaded on time.
Consumer alerts make sure the workers on the line are keeping up.

Producer-Side Alerts: Watching the Scheduler

The scheduler鈥檚 job is to decide when each shipment should be checked again and assign it a next_poll_at time. If the scheduler falls behind, shipments don鈥檛 get scheduled on time, and downstream processes suffer.

We track two key signals:

1. Scheduler Job Lag 鈥� 鈥淐onfig Time Miss鈥�

What it means: Every 5 minutes, the scheduler should assign a next_poll_at to shipments.
How we check: At any given moment (T0), we look back 10 minutes. If those shipments still don鈥檛 have a schedule (next_poll_at is missing), it means the scheduler is lagging.
Why it matters: If the scheduler slows down, the whole system loses its rhythm.

2. Missed Previous Polls 鈥� Per Shipment

What it means: A shipment is polled now, but it wasn鈥檛 polled in the last cycle (a 2-hour window).
Why it matters: This highlights fairness issues鈥攕ome shipments might get skipped because of ceiling limits, slice caps, or race conditions. Spotting this helps us prevent silent skips in polling.

Consumer-Side Alerts: Watching the Workers

Even if the scheduler is doing its job, things can still go wrong on the consumption side.

What we track: Whether messages start piling up in the queue.
Why it matters: If consumers (the workers pulling shipments for processing) slow down or stall, backlogs build up quickly. That鈥檚 our early warning signal to scale workers or investigate bottlenecks.

Why All This Matters

Queues are the heartbeat of our tracking system. A delay on the producer side means shipments don鈥檛 even enter the pipeline. A delay on the consumer side means shipments are stuck waiting.

By monitoring both ends, we make sure:

No shipment is left unscheduled.
No shipment gets unfairly skipped.
No queue grows silently in the background.

This two-pronged alerting system helps us catch issues early and keep delivery experiences smooth for our clients.

Hritik Karwasra

Software Developer - Post Dispatch

As a Software Developer, I see every challenge as an opportunity to code smarter solutions. I鈥檓 passionate about understanding how technologies operate at scale and applying those insights to build AI Agents tailored for unique and complex problems. Alongside this, I enjoy creating developer tools and end-to-end pipelines that are scalable, adaptive, and designed with robust monitoring to stand the test of scale and time.

View profile

Engineering