Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with small pool sizes #47

Closed
MoritzKeppler opened this issue Oct 21, 2022 · 3 comments · Fixed by #86
Closed

issue with small pool sizes #47

MoritzKeppler opened this issue Oct 21, 2022 · 3 comments · Fixed by #86

Comments

@MoritzKeppler
Copy link

A minor issue we found during testing with small pool sizes (in our case: min=0, max=2):
If max pool size is reached garm will ignore additional jobs queued by GH.
As soon as the currently running jobs are completed, the pool is scaled down to min=0 again and the jobs waiting in GH will never be scheduled until we manually resend the webhook event / manually trigger the creation of a new runner.

Moritz Keppler moritz.keppler@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, legal info/Impressum

@gabriel-samfira
Copy link
Member

gabriel-samfira commented Oct 21, 2022

Indeed. That is a bit of a pickle. Garm creates new runners based on 2 things:

  • an event from github
  • ensuring min_idle_runners

And will obey the max runner limit ignoring "queued" events if that limit is reached. This means that one decision factor is removed in that circumstance. And min_idle_runner=0 will eliminate the other.

An immediate solution would be to have at least 1 set on min_idle_runners. A good solution would be to queue the new runner in the database instead of ignoring the event, and not spin it up until we can accommodate it (number of runners falls under max). Unfortunately, this one will take some time to implement due to backlog. But it will be fixed.

I for the time being would suggest to define pools at the org/enterprise level (which should reduce the number of pools you will need), and have at least 1 runner per pool.

@MoritzKeppler
Copy link
Author

thanks! definitely not a critical issue as we have a workaround.

Queued runners in the database could be a solution to at least guarantee that there will be an sufficient amount of runners. But what would happen if the job gets cancelled in the meanwhile or executed on another runner or ...? Maybe easier to have an algorithm that just reacts on the current load, e.g. "always start some extra runners that get automatically shutdown again if not used within a timeframe"... but only a vague idea, not sure if it will work.

@gabriel-samfira
Copy link
Member

The main issue I see now, is that we ignore events sent via web hooks if max_runners is reached. That is not a good thing to do, and the issue you're seeing is proof of that.

I think a good approach would be to record web hook events as they come in, and use that as a source of truth to make decisions. It also gives us greater transparency into which event triggered the creation of a runner.

We currently receive web hooks for 3 events:

  • A workflow is queued. This means a workflow has been requested, but no runner picked it up yet.
  • A workflow transitions to in_progress. This means that one of our runners or some other runner has picked up a workflow in our repo. If it's one of ours, we currently mark the runner as active.
  • A workflow transitions to completed. This means that a workflow has been canceled or has finished running. The outcome of the workflow is not relevant (error or success). If a workflow is canceled before a runner picks it up, the runner_name property is not set.

What we could do is to create a new entity in our store that tracks web hook events and their state. Then we can use this to scale our runners. As long as we receive web hooks, we can keep this updated. The challenge is to make this robust enough to still work during periods of github outages, prolonged downtime periods for garm, or even restarts. We also need to keep in mind that we may have pools and web hooks configured on all 3 levels (repo, org, enterprise), so a workflow will trigger an event on all 3 for the same workflow.

This is a non-trivial change that will require some thought and time, but will ultimately improve the way garm makes decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants