Continuously increasing drift in scheduled workflow, more than 4 hours now #196910

lelegard · 2026-05-25T13:39:56Z

lelegard
May 25, 2026

🏷️ Discussion Type

Question

💬 Feature/Topic Area

Schedule & Cron Jobs

Discussion Details

Hi,

I have observed a regular drift in scheduled workflows on GitHub runners. I know that the scheduled time is a minimal time and the workflow starts some time after the scheduled time.

However, in the past few months, there is an increasing drift with an average delay of more than 4 hours now. In 2025, the average delay was 1h 40mn, already quite a lot, but stable. Since then, the delay is regularly increasing, reaching 4h 30mn now.

I have a "nightly build" workflow which is scheduled at 00:40 UTC (to avoid peak traffic at 00:00):

on:
  schedule:
  - cron: '40 0 * * *'

Here are some actual starting dates and times, UTC, in the last year:

2026-05-25 05:08:22
2026-05-01 04:37:13
2026-04-01 03:52:31
2026-03-01 03:32:43
2026-02-01 03:39:12
2026-01-01 02:57:05
2025-12-01 02:57:02
2025-11-01 02:22:19
2025-10-01 02:21:29
2025-09-01 02:31:11
2025-08-01 03:00:41
2025-07-01 02:39:34
2025-06-01 02:50:41
2025-05-01 02:27:41

The initial job of the workflow runs on a ubuntu-latest GitHub runner. Then, jobs are dispatched on ubuntu-latest and windows-latest. The workflow is https://github.com/tsduck/tsduck/actions/workflows/nightly-build.yml

Any explanation? I don't complain because this is free CI for an open source project. However, the instability and continuously increasing delay is worrying.

hzijad · 2026-05-25T14:02:38Z

hzijad
May 25, 2026

Hey lelegard,

You are definitely not alone in seeing this, and your data perfectly illustrates a massive headache that a lot of open-source maintainers are facing right now.

That 4+ hour delay isn't a problem with your code, it's an infrastructure bottleneck. Even though you wisely avoided the exact 00:00 UTC mark, scheduling at 00:40 UTC puts your workflow right in the path of the global peak traffic wave. Thousands of developers schedule builds in that first hour of the UTC day. Because GitHub's cron queue runs on shared global infrastructure, a massive backlog at midnight causes a compounding ripple effect that pushes later jobs further and further back. As GitHub has grown over the last year, that queue has simply become more congested.

Since you need a reliable nightly build, here are two ways to fix this:

Shift out of the Midnight Window entirely
The hours between 22:00 UTC and 02:00 UTC are the most congested. If you move your nightly build to a completely off-peak global time, like 14:13 UTC or 18:37 UTC, it will bypass the backlog entirely. Using "odd" minutes instead of 10-minute increments also helps you slide into quieter windows in the scheduler.
Use an External Trigger
If you absolutely need the build to run around 00:40 UTC, you can't rely on GitHub's native cron. The standard community workaround is to keep workflow_dispatch active in your YAML, and use a free external service (like Google Cloud Scheduler, AWS EventBridge, or a webhook tool) to ping GitHub’s API at exactly 00:40. Because API-driven events bypass the sluggish cron queue, your runner will spin up almost instantly.

It's awesome that you kept such a detailed log of your build times. It really highlights how platform scale affects open-source infrastructure. Good luck tracking down a quieter slot for the build!

0 replies

kailashv2 · 2026-05-25T14:05:48Z

kailashv2
May 25, 2026

Your data suggests something more interesting than normal scheduler jitter. The trend appears progressively increasing rather than randomly distributed, which makes me hesitant to attribute it solely to expected queue latency.

A few observations:

You've already avoided the common 00:00 UTC contention window by scheduling at 00:40 UTC
The drift appears to grow over time (~1h40m → ~4h30m), which looks more like systemic pressure than isolated scheduling noise
The pattern is relatively smooth across months instead of oscillating heavily

I would separate the problem into two distinct phases:

1. Scheduler latency
(Time between cron trigger and workflow creation)

2. Runner acquisition latency
(Time between workflow creation and actual job execution)

Right now those two are blended together.

I would inspect:

Scheduled execution timestamp
Workflow queued timestamp
Job start timestamp

If workflow creation itself is delayed by several hours, that points toward scheduling backlog. If workflow creation happens close to 00:40 UTC but jobs remain queued for hours, then runner availability becomes a more likely explanation.

I would also run a controlled experiment:

on:
  workflow_dispatch:
  schedule:
    - cron: '40 0 * * *'

Manually trigger the identical workflow near the same time window and compare queue behavior.

Another useful signal would be moving the scheduled time to a deliberately unusual slot (for example 11:17 UTC or 16:23 UTC) for several days and measuring the delta.

The most interesting aspect here is not the existence of delay — scheduled workflows have always had some elasticity — but the apparent monotonic increase over time. That trend suggests either changing infrastructure characteristics or changing load patterns rather than ordinary scheduling variance.

0 replies

johan-lindqvist · 2026-05-25T14:38:04Z

johan-lindqvist
May 25, 2026

We're seeing issues where our workflows that run every 15 minutes are only running every 90 minutes, starting today. Seems like something changed around midnight UTC today where they started getting slower and slower (30 minutes to start, now up to 90 minutes).

Our cron trigger:

cron: '0/15 * * * *'

0 replies

lelegard · 2026-05-25T15:12:31Z

lelegard
May 25, 2026
Author

Thanks @hzijad, @kailashv2, @johan-lindqvist for your interest.

My data were extracted by a Python script which uses the GitHub API to collect information on all runs of that specific workflow. I posted only one line per month but I got a log of all executions, numbered 1939 to 2369.

Data are retained down to April 2025. The oldest run is numbered 1939, meaning that information from the previous 1938 runs of that workflow were erased (I can understand that). So, we have a view over one year only. If my memory serves me right, initial runs were reasonably aligned on the scheduled time, years ago.

I will try to collect more information and move the schedule to some exotic day-time. It is weird that people target "night time" UTC instead of "night time" in their own time zone. Being located in Europe, UTC is almost my time and it makes sense to use round midnight UTC. But I don't understand why American or Asian developers schedule their jobs in the middle of their working days.

0 replies

lelegard · 2026-05-25T15:54:19Z

lelegard
May 25, 2026
Author

@kailashv2

I would inspect:

Scheduled execution timestamp
Workflow queued timestamp
Job start timestamp

How would you get the first two from the GitHub API? I can only get "created" and "started" timestamps and they are always identical. I don't know if "created" means "enqueued". Concerning the planned scheduled date, I can only assume what's in the yml file.

  Run#   Created               Started               Duration   Status     Origin
------   -------------------   -------------------   --------   --------   ---------
  2369   2026-05-25 05:08:22   2026-05-25 05:08:22   00:59:21   failure    schedule
  2368   2026-05-24 04:49:20   2026-05-24 04:49:20   01:50:03   success    schedule
  2367   2026-05-23 04:27:01   2026-05-23 04:27:01   00:00:10   success    schedule
  2366   2026-05-22 04:46:20   2026-05-22 04:46:20   00:00:13   success    schedule
  2365   2026-05-21 04:55:25   2026-05-21 04:55:25   01:51:24   success    schedule
  2364   2026-05-20 04:49:00   2026-05-20 04:49:00   01:56:44   success    schedule

I have updated the nightly build schedule to 14:37 UTC (no so nightly now). We'll see if it affects the delay.

I cannot update that too often because each push results in 2 hours of CI/CD execution.

0 replies

hzijad · 2026-05-25T16:50:19Z

hzijad
May 25, 2026

Hey lelegard,

You're completely right, and you aren't missing anything in your Python script, the GitHub API fundamentally collapses this telemetry, making it impossible to see the "hidden" queue time.

To answer your question directly: you cannot pull a separate "scheduled execution" or "queued" timestamp from the API.

For scheduled crons, GitHub doesn't create the workflow object until it is actually ready to execute it. The created_at timestamp is generated the exact millisecond the workflow is instantiated in their database, and because a runner is assigned immediately after, created_at and started_at will always look identical.

The hours of drift you are seeing happen before the API even knows the run exists. GitHub’s internal cron engine is holding onto your trigger, delayed by the global backlog, before it finally spawns the workflow object.

Your point about developers targeting midnight UTC is spot on, and it explains why the drift is compounding. Most people don't map crons to their local time zones; they just copy-paste boilerplate templates from documentation, which almost always default to 0 0 * * *. On top of that, enterprise tools and automated dependency bots (like Dependabot) are hardcoded to trigger at the turn of the UTC day. You've been fighting a massive wave of default settings.

Moving your schedule to 14:37 UTC is the perfect fix. It gets you completely out of that midnight bottleneck and avoids round numbers. Don't worry about pushing any more changes just let this one ride for a few days, and you should see those Created timestamps finally align perfectly with your scheduled time!

0 replies

doncjohn · 2026-05-25T17:34:04Z

doncjohn
May 25, 2026

For the API question in your follow-up: I do not think you are missing a separate timestamp. For scheduled workflows, the public workflow run object is created only when GitHub actually materializes the scheduled run. That means created_at and run_started_at can be identical even when the cron event was delayed for hours before the run object existed.

So, from the public API, you generally have:

expected schedule time: compute this yourself from the cron expression
workflow run creation time: created_at
job start time: from the jobs API, e.g. started_at

The hidden part is the time between the cron expression's expected fire time and GitHub creating the workflow run. That is the drift you are measuring, but GitHub does not expose it as a first-class scheduled_at or queued_at field for cron events.

For measurement, I would compute it explicitly in your script:

drift = workflow_run.created_at - expected_cron_time_for_that_day
runner_wait = first_job.started_at - workflow_run.created_at

If created_at and first job started_at are nearly equal, the delay is before workflow creation, not runner acquisition.

A useful in-workflow breadcrumb is to print both the actual UTC time and the intended slot at the top of the first job:

- name: Record schedule timing
  run: |
    date -u +'%Y-%m-%dT%H:%M:%SZ'
    echo "Expected cron slot: 00:40 UTC"

That will not reveal GitHub's internal queued time, but it makes future logs easier to audit.

Moving to 14:37 UTC is a good experiment. If the drift collapses, it strongly suggests cron backlog before run creation. If it stays large, then I would start suspecting a broader scheduler issue and open a support ticket with the run IDs and your computed drift table.

0 replies

kailashv2 · 2026-05-25T22:16:05Z

kailashv2
May 25, 2026

The additional data you collected is actually very informative. The interesting signal here is that created_at and started_at are effectively identical, which strongly suggests the runner itself is not the bottleneck.

That shifts suspicion toward the period before the workflow run object even exists.

A few observations:

Your measured drift has a fairly smooth upward trend over many months rather than noisy day-to-day variation.
The delay appears to happen before workflow materialization (created_at ≈ started_at).
Other users in this thread are now reporting similar behavior with high-frequency schedules (15-minute workflows running every ~90 minutes), which points more toward shared scheduler pressure than repository-specific behavior.

One thing I would be careful about though: I would avoid assuming that 00:40 UTC alone explains a consistent increase from ~1h40m to ~4h30m. Midnight contention can explain large delays, but a monotonic increase over time feels more like changing platform load characteristics than ordinary queue variance.

Your experiment with 14:37 UTC is probably the most valuable next step because it isolates the hypothesis cleanly.

If I were measuring this, I would compute:

expected_time = scheduled_cron_time
scheduler_drift = created_at - expected_time
runner_delay = started_at - created_at

From the data shown so far:

scheduler_drift ≈ several hours
runner_delay ≈ 0

If moving to 14:37 UTC suddenly collapses the drift from hours to minutes, that strongly supports scheduler backlog before workflow creation.

If it remains in the multi-hour range even outside common scheduling windows, I would start suspecting either:

a broader GitHub scheduling issue
prioritization differences affecting free/open-source workloads
internal queue behavior changing over time

The nice thing is that your change only requires patience now. You already set up what is essentially a controlled experiment.

0 replies

lelegard · 2026-05-26T18:35:48Z

lelegard
May 26, 2026
Author

The first run after moving the schedule at 14:37 started at ... 17:46:12.
Still delayed by more than 3 hours.

So, the midnight traffic jam was not the culprit. Do you use scheduled jobs? Do you experience such delays?

Is it possible that GitHub prioritises scheduled jobs by average duration of previous runs? This job runs for 1h 50mn when there is something to build and only a few seconds if the project was not modified in the last 24 hours.

0 replies

lelegard · 2026-06-02T13:01:57Z

lelegard
Jun 2, 2026
Author

Some tests from other times of the day:

00:40 UTC (initial schedule): from 1h30 to 4h30 delayed over a year.
14:37 UTC: 3h delayed.
23:11 UTC: 1h to 2h delayed.

The time of the day definitely has an influence, but the delay is still important at all times.

I have opened a support issue here: actions/runner#4468

0 replies

nebuk89 · 2026-06-04T08:58:06Z

nebuk89
Jun 4, 2026

Hey y'all, sorry I missed you raising this.
We are aware that the drift on the start of our scheduled jobs has got worse; we will be working on this as we go through our current availability improvement work but (there is always a 'but' 😞 ) this isn't a fix 'now'. This drift is part of us balancing load coming in as scheduled drops have grown >30% in 2ish months. We need to get everything properly stable before we increase the throughput further from scheduled jobs.

Keep poking me here, I am here and reading/listening. We are working on this as part of the wider work <3

3 replies

johan-lindqvist Jun 4, 2026

I see a lot of people doing custom solutions for triggering workflows on a schedule which then uses workflow_dispatch to run the workflow. But I assume the more people do this it will further increase the delays in the native scheduling due to the increased load on the runners which is a really bad situation.

I don't want to use a custom solution to trigger using workflow_dispatch but I also am dependent on scheduled workflows that are working so there's not really much of a choice at this point. I really hope you resolve this issue quickly.

lelegard Jun 4, 2026
Author

@nebuk89

This drift is part of us balancing load coming in as scheduled drops have grown >30% in 2ish months

My experience is more 150 to 200% than 30%. Delayed by 4 hours now...

Is there any criteria to prioritize scheduled jobs? Apart from pay/free account. My "nightly build" job takes either 1h 40mn (when the repo was updated) or 15 seconds (when it was not). Does it make a difference?

Note that I don't complain, I don't really care about the time the build is done and published, as long as it is done every day. I just try to understand the reasons.

nebuk89 Jun 5, 2026

It is honestly load at spikes, we have some limited prioritisation work but honestly 'everyone' is seeing unacceptable delays IMO (some are seeing less but 4 hours or 2 hours or even 'only 1 hour' isn't even close to good enough).

So it is just load, and we are nearing a point where that is constant so I cannot provide a good 'oh if you schedule at this time it will be much better'.

It sucks -> the experience needs fixing.

I am going to have one of my team add a roadmap item for this in the next week and start getting clearer dates on when this will get better.

techlinn · 2026-06-04T15:57:05Z

techlinn
Jun 4, 2026

@hzijad had the right idea with external triggers, and @johan-lindqvist's concern about a vicious cycle is fair, but I don't think it applies here lol.

workflow_dispatch and the cron scheduler are separate queues. The drift you're seeing lives entirely in the cron scheduler. When you trigger via workflow_dispatch, you skip that queue completely and the run starts within seconds. You're not adding load to the thing that's broken.

Use cron-job.org to POST to the dispatch endpoint on whatever schedule you need.
https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow_id}/dispatches

Headers:

Authorization: Bearer YOUR_PAT
Accept: application/vnd.github+json

Body:
{ "ref": "main" }

A nicely grained PAT with Actions read/write on the repo is all you need. If you want it more auditable, a GitHub App with actions: write is cleaner, but for a personal project the PAT is fine in my opinion.

On the account type question: scheduled jobs don't get prioritized by plan. It's queue position. The only way around it is dedicated runners, which enterprise orgs can set up, but that's not an option for most people here.

Hope this answers the question!

4 replies

johan-lindqvist Jun 5, 2026

Are you sure the load problem is exclusively the scheduler? If so, where do you get that information from? Scheduling seems like something that would not be that difficult to scale?

I assumed (maybe incorrectly) that the scaling issue is with the amount of runners and not the actual scheduling service. If that assumption is correct then my reasoning that people using workflow_dispatch skips the queue which causes more load on the runners which in turn has an effect on the scheduling since it looks at the runner load and delays the workflows if the runner load is too high.

lelegard Jun 5, 2026
Author

I agree with @johan-lindqvist. We can safely assume that the scheduler itself consumes way much less resources than the jobs (otherwise, it would be a very bad design if the overhead is bigger than the payload).

Therefore, if we ignore possible bugs or design flaws in the scheduler, I suppose that the scheduler tries to fairly distribute heavy demands over undersized resources. Bypassing the GitHub scheduler and triggering immediate execution from an external scheduler is a rogue behaviour. It is resource hijacking.

All schedulers work the same way. They assume that there is a larger number of scheduled executions than immediate executions because the latter requires human attention (which is scarce) while the former is configured once for all. Additionally, when a human explicitly triggers a job, he needs the result now, not in 4 hours, while a daily job does not always need a precise time as long as it is executed every day. Therefore, it is legitimate that immediate triggers get a higher priority than scheduled jobs, in all environments (the Linux or Windows kernels treat processes the same way).

So, one possible explanation to my problem is that too many people are already using that rogue behaviour of using an external scheduler to trigger immediate jobs. Instead of fairly queuing with others, those "scheduled" jobs bypass the GitHub scheduler and they infinitely delay well-behaved scheduled jobs.

So, @techlinn, your suggestion could be a selfish solution to my problem but it may also be the cause of the collective problem that all users of well-behaved scheduled jobs have.

techlinn Jun 5, 2026

Hmm, I'm not totally certain either.

What I can point to is @nebuk89's exact wording. They said "scheduled drops have grown >30% in 2ish months." A drop is a batch of jobs being dispatched by the scheduler. If this were purely a runner capacity problem I'd expect the language to be about runner demand or queue depth, not drops. That phrasing points at the scheduling dispatch layer itself struggling with volume, not just the pool being undersized.

That said, GitHub does give workflow_dispatch higher priority than scheduled jobs by design, so yeah, triggering via an external cron does mean your job jumps ahead of other scheduled ones in the runner queue. That part is real. Whether you call it "resource hijacking" or just using the system the way GitHub built it is kind of a philosophical question since their own support recommends this workaround, but the tradeoff you're describing is legitimate. It moves the problem off your plate onto someone else's.

If collective fairness actually matters here, self-hosted dedicated runners are the only option that doesn't touch anyone else's queue.

nebuk89 Jun 5, 2026

A minor point of clarity, the queue and delay here 'doesn't care' if you are hosted or self-hosted. The system doesn't have that info at this point (currently, one of the improvements we need to make) and as such the throttle is global and naive :-/

That said, opening the flood gates for self-hosted isn't quite a 'just do it' as suddenly ramping up a ton of load on GitHub is probably a great way to shoot ourselves in the foot.

As with my other comment, I will get a roadmap item up and get a plan back here.

I am actually heading out on a personal leave next week, but will have my team work on a plan while I am out and update here once I am back

Luke-Corwin · 2026-06-05T14:27:12Z

Luke-Corwin
Jun 5, 2026

What you’re seeing is most likely scheduler queueing rather than runner queueing.

A common misconception is that the cron trigger fires exactly on schedule and then waits for a runner. In reality, GitHub’s scheduled workflows are best-effort and may be delayed before the workflow run is even created.

A few observations from your data:

The drift appears relatively smooth and cumulative over time.
The delay is measured in hours rather than minutes.
The workflow is scheduled once per day, so there is no overlap with previous runs.
The actual runner execution time is unlikely to explain the pattern.

This suggests the bottleneck is probably in GitHub’s scheduling infrastructure rather than runner availability.

Some possibilities:

Scheduled workflows for public repositories may be deprioritized during periods of high platform load.
GitHub’s scheduler may internally shard cron jobs, and some shards could be experiencing increased backlog.
Repositories with low activity sometimes appear to receive less aggressive scheduling than repositories with frequent pushes, although GitHub has never publicly documented such behavior.

One thing worth checking is the distinction between:

workflow creation time
first job start time

Using the Actions API, you may be able to determine whether the delay occurs before the workflow run is created or after the run enters the queue.

If the run itself is not created until several hours after the scheduled cron time, that would strongly indicate scheduler-side backlog.

If the run is created on time but the first job starts hours later, that would point more toward runner capacity.

Given the steady increase from roughly 2 hours in mid-2025 to over 4 hours recently, my guess would be platform-level scheduling backlog rather than anything specific to your workflow definition.

I’d be interested to know whether other maintainers of large open-source projects are seeing similar trends in scheduled workflow latency during the same period.

0 replies

lelegard · 2026-06-05T15:37:57Z

lelegard
Jun 5, 2026
Author

@Luke-Corwin

The actual runner execution time is unlikely to explain the pattern.

Maybe, maybe not. It would make sense to assign some slightly lower priority to recurrent jobs which regularly use a lot of resources. My job has a consistent delay. Other users report consistent delays as well, but not as long as mine. So, there must be something attached to each job. Since mine runs during 1h40 most of the time and just a few seconds the rest of the time, that could be a possibility since 1h40 is maybe a lot for a free public repo.

Repositories with low activity sometimes appear to receive less aggressive scheduling than repositories with frequent pushes, although GitHub has never publicly documented such behavior.

My repo has pushes almost every day. I wouldn't call this low activity.

Using the Actions API, you may be able to determine whether the delay occurs before the workflow run is created or after the run enters the queue.

No, we can't, that was explained in previous posts of that thread. The only dates you can get from the API are "created" and "started" which are exactly the same all the time. If they referred to distinct events, sometimes there would be a difference of at least a few seconds.

1 reply

Luke-Corwin Jun 5, 2026

I agree that the available timestamps make it difficult to separate queue time from workflow creation time. If created_at and run_started_at are always identical, then the API isn’t exposing enough information to determine where the delay is actually occurring.

That said, the existence of a consistent delay doesn’t necessarily imply that GitHub is attaching a fixed penalty to a particular workflow. It could also be the result of internal scheduling factors that aren’t exposed publicly, such as runner availability, repository-level allocation, workflow characteristics, or account-wide resource management.

The interesting part is that multiple users appear to observe repeatable delays rather than purely random queue times. If that pattern is real, then there is likely some stable attribute influencing scheduling decisions. The challenge is that GitHub has never documented what those attributes are, so we’re left inferring behavior from observations rather than from any official explanation.

In my case, the workflow’s runtime is highly bimodal: it usually runs for around 1h40m, but occasionally completes in seconds. If scheduler decisions take historical resource consumption into account, that would be one possible explanation worth considering, although I agree there is currently no evidence proving that is how GitHub’s scheduler works.

Continuously increasing drift in scheduled workflow, more than 4 hours now #196910

Uh oh!

🏷️ Discussion Type

💬 Feature/Topic Area

Discussion Details

Replies: 14 comments · 8 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lelegard May 25, 2026 Author

Uh oh!

lelegard May 25, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lelegard May 26, 2026 Author

Uh oh!

lelegard Jun 2, 2026 Author

Uh oh!

Uh oh!

Uh oh!

lelegard Jun 4, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lelegard Jun 5, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lelegard Jun 5, 2026 Author

Uh oh!

Replies: 14 comments 8 replies

lelegard
May 25, 2026
Author

lelegard
May 25, 2026
Author

lelegard
May 26, 2026
Author

lelegard
Jun 2, 2026
Author

lelegard Jun 4, 2026
Author

lelegard Jun 5, 2026
Author

lelegard
Jun 5, 2026
Author