Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Cache warming exceeds GCR rate limiting #778

Closed
jpellizzari opened this issue Oct 5, 2017 · 7 comments
Closed

Cache warming exceeds GCR rate limiting #778

jpellizzari opened this issue Oct 5, 2017 · 7 comments

Comments

@jpellizzari
Copy link

Reported by a customer.

Logs for the weave-flux-agent show that the agent is making way too many requests in a short period of time:
screen shot 2017-10-05 at 1 26 39 pm

Caller shows as warming.go: https://github.com/weaveworks/flux/blob/master/registry/warming.go

This seems to have started happening since the upgrade to 1.0.1

@jpellizzari
Copy link
Author

After bouncing the agent, it spammed a little more, then terminated:
screen shot 2017-10-05 at 1 55 25 pm

@writer-jr
Copy link

This appears to be the result of Google deprecating V1:
textPayload: "ts=2017-10-05T21:14:19Z caller=warming.go:145 component=warmer err="requesting manifests: getting remote manifest: Get https://gcr.io/v2/qordoba-devel/segmentation/manifests/latest: http: non-successful response (status=404 body=\"{\\\"errors\\\":[{\\\"code\\\":\\\"MANIFEST_UNKNOWN\\\",\\\"message\\\":\\\"Manifest with tag 'latest' has media type 'application/vnd.docker.distribution.manifest.v2+json', but client accepts 'application/vnd.docker.distribution.manifest.v1+json'.\\\"}]}\")" "

@squaremo
Copy link
Member

squaremo commented Oct 5, 2017

Mitigation: tune the daemon arguments --registry-rps and --registry-burst way down (10 and 1). These govern the rate limiting for outgoing image registry requests so it doesn't spam even if it's getting errors.

The errors in question are

textPayload: "ts=2017-10-05T21:14:19Z caller=warming.go:145 component=warmer err="requesting manifests: getting remote manifest: Get https://gcr.io/v2/path/to/repo/manifests/latest: http: non-successful response (status=404 body=\"{\\\"errors\\\":[{\\\"code\\\":\\\"MANIFEST_UNKNOWN\\\",\\\"message\\\":\\\"Manifest with tag 'latest' has media type 'application/vnd.docker.distribution.manifest.v2+json', but client accepts 'application/vnd.docker.distribution.manifest.v1+json'.\\\"}]}\")"

That last bit is the clue: GCR appear to no longer be serving v1 schema manifests. We assume v1 manifests are available for all images (which seems to be true elsewhere). On the other hand, dockerhub don't seem to serve schema2 manifests for all images, only relatively recent ones. Sigh. Might have to look at the Content-Type to see which.

https://gist.github.com/squaremo/b3b832d6a3b9e0cc07fb35b854a66d47 has a program I used to see what the schema2 manifests look like, using the same docker registry client we use (https://github.com/heroku/docker-registry-client).

If we can't use schema1 manifests, we can use schema2. It takes an extra step, to fetch the Config blob as shown in the gist, but that has the information we need. It looks like this:

{
  "architecture": "amd64",
  "author": "Weaveworks Inc <help@weave.works>",
  "config": {
    // ...
  },
  "container": "1aa5661ab22a2b34c0e785d94f24c61e129266dfbfa80b2b8e1926f96256966c",
  "container_config": {
    // ...
  },
  "created": "2016-11-29T16:03:39.813618407Z",
  // ...
}

NB the created datetime (though we should check it matches what we get from the schema1 manifests).

@errordeveloper
Copy link
Contributor

Also, I happen to look at the code that gets auth token and learned there is a client library we can use, which might have proper means of refreshing the token based on the TTL.

This is how the client library can be used:

import (
	"errors"
	"golang.org/x/oauth2"
	"golang.org/x/oauth2/google"
)

func tokenFromEnv() (string, error) {
	var GCRScopes = []string{"https://www.googleapis.com/auth/cloud-platform"} // or whatever scopes you want
	var OAuthHTTPContext = oauth2.NoContext
	ts, err := google.DefaultTokenSource(OAuthHTTPContext, GCRScopes...)
	if err != nil {
		return "", err
	}
	token, err := ts.Token()
	if err != nil {
		return "", err
	}
	if !token.Valid() {
		return "", errors.New("access_token was invalid")
	}
	return token.AccessToken, nil
}

I've copied that code from another thread I had with some folks at Google. Just FYI, thought this MBOI.

@awh
Copy link
Contributor

awh commented Oct 11, 2017

@jr-qordoba I've backported @squaremo's #780 PR onto the 1.0.1 branch and pushed a preview image to quay.io/weaveworks/flux:1.0.2-pre, any chance you could give it a try before we cut a 1.0.2 final release? Thankyou!

@writer-jr
Copy link

flux:1.0.2-pre appears to work as expected in my dev, test, and prod clusters. Thanks!

@squaremo
Copy link
Member

I'm going to close this one, with a coda:

The underlying problem was that all requests to GCR were trivially failing (fixed in large part by a back-ported #780, and latterly #801).

We have rate limiting on the requests we make: 200 a second (you are limited to 200 requests over a 1s window), with 125 burst (i.e., you can have 125 requests in play at once). If all requests fail quickly, that means you can get 125 or so errors in the log in a very short period, which is certainly alarming. It's not necessarily the case that gcr will throttle or otherwise reject requests out of hand, though you may want to tune the rps and burst down for other reasons (https://cloud.google.com/container-registry/pricing for example).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants