Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

squaremo · 2018-09-12T15:03:09Z

Vary the sampling rate

To drive the refresh of cache data, we expire entries after a
configurable duration (default of 1 hour).

Because many image tags never change what they point to, this means
the image cache does thousands of needless fetches every hour (or
whatever the configured duration is). On the other hand, if you set it
to much longer (say 24h), any changes to a tag won't be noticed for
that much longer.

This commit gives each manifest its own, adaptive schedule for being
refreshed. It does this by doubling the period when a freshly fetched
manifest does not differ from the existing entry, and halving it when
it does differ. (The refresh period is clipped to [minRefresh, maxRefresh])

A tag that doesn't change will end up being polled infrequently. A tag
that changes occasionally will alternate between a number of values,
depending on how regular the changes are.

There are surely more accurate ways of arriving at a good estimate for
the polling frequency based on the past samples; this one has the
advantage of being very simple, and requiring little state to be kept.

Back off on HTTP 429

Image registries tend to have rate limiting, and it is hard to know in
advance what the practical limits are -- and they will often
change. Nonetheless, we have to back off when we get a HTTP 429,
otherwise the registry could e.g., decide to blacklist our IP.

Therefore:

decrease the rate limit for a registry domain if we get HTTP 429
- we only reliably see this in the round-tripper, but that's where
  the rate limiting is anyway
- this goes through the RateLimiters struct so that we can log it
we also want to edge back towards the configured rate when we have
succeeded
- but we only know if we succeeded in the calling code, so it has
  to be wired through

To drive the refresh of cache data, we expire entries after a configurable duration (default of 1 hour). Because many image tags never change what they point to, this means the image cache does thousands of needless fetches every hour (or whatever the configured duration is). On the other hand, if you set it to much longer (say 24h), any changes to a tag won't be noticed for that much longer. This commit gives each manifest its own, adaptive schedule for being refreshed. It does this by doubling the period when a freshly fetched manifest does not differ from the existing entry, and halving it when it does differ. (The refresh period is clipped to `[minRefresh, maxRefresh]`) A tag that doesn't change will end up being polled infrequently. A tag that changes occasionally will alternate between a number of values, depending on how regular the changes are. There are surely more accurate ways of arriving at a good estimate for the polling frequency based on the past samples; this one has the advantage of being very simple, and requiring little state to be kept. Note that the expiry of the keys is now distinct from the deadline for being refreshed. These were previously conflated, with the refresh "timed" so it would fetch new metadata just before an entry was due to expire. The expiry is set so that there is a generous grace period after the deadline for a new value to be fetched and written.

Image registries tend to have rate limiting, and it is hard to know in advance what the practical limits are -- and they will often change. Nonetheless, we have to back off when we get a HTTP 429, otherwise the registry could e.g., decide to blacklist our IP. Therefore: - decrease the rate limit for a registry domain if we get HTTP 429 - we only reliably see this in the round-tripper, but that's where the rate limiting is anyway - this goes through the RateLimiters struct so that we can log it - we also want to edge back towards the configured rate when we have succeeded - but we only know if we succeeded in the calling code, so it has to be wired through

To see whether the image cache refresh is doing what you expect, it's useful for it to log what it's doing as it does it (and why it does things). This commit adds some trace messages, switched on with the `--registry-trace` flag, and tidies up some of the other logging.

rndstr

lgtm

squaremo force-pushed the rewrite/image-fetch-backoff branch from a4084ae to 23448bb Compare September 12, 2018 15:07

squaremo added 4 commits September 12, 2018 16:17

Remove dead code ContextRoundTripper

7839827

squaremo force-pushed the rewrite/image-fetch-backoff branch from 249a30a to 1c3a901 Compare September 12, 2018 15:17

squaremo requested a review from rndstr September 14, 2018 09:46

rndstr approved these changes Sep 17, 2018

View reviewed changes

squaremo merged commit 2691f5e into master Sep 18, 2018

squaremo deleted the rewrite/image-fetch-backoff branch September 18, 2018 10:49

stefanprodan mentioned this pull request Sep 28, 2018

Flux warmer should handle HTTP 429 responses #1401

Closed

stefanprodan mentioned this pull request Dec 7, 2018

flux hitting Google API rate limits when working with GCR #1016

Closed

squaremo mentioned this pull request Feb 5, 2019

Back-off failed image metadata fetches #910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

squaremo commented Sep 12, 2018

rndstr left a comment

Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

Conversation

squaremo commented Sep 12, 2018

Vary the sampling rate

Back off on HTTP 429

rndstr left a comment

Choose a reason for hiding this comment