Cache warming exceeds GCR rate limiting #778

jpellizzari · 2017-10-05T20:52:36Z

Reported by a customer.

Logs for the weave-flux-agent show that the agent is making way too many requests in a short period of time:

Caller shows as warming.go: https://github.com/weaveworks/flux/blob/master/registry/warming.go

This seems to have started happening since the upgrade to 1.0.1

The text was updated successfully, but these errors were encountered:

jpellizzari · 2017-10-05T20:57:59Z

After bouncing the agent, it spammed a little more, then terminated:

writer-jr · 2017-10-05T22:14:58Z

This appears to be the result of Google deprecating V1:
textPayload: "ts=2017-10-05T21:14:19Z caller=warming.go:145 component=warmer err="requesting manifests: getting remote manifest: Get https://gcr.io/v2/qordoba-devel/segmentation/manifests/latest: http: non-successful response (status=404 body=\"{\\\"errors\\\":[{\\\"code\\\":\\\"MANIFEST_UNKNOWN\\\",\\\"message\\\":\\\"Manifest with tag 'latest' has media type 'application/vnd.docker.distribution.manifest.v2+json', but client accepts 'application/vnd.docker.distribution.manifest.v1+json'.\\\"}]}\")" "

squaremo · 2017-10-05T22:23:31Z

Mitigation: tune the daemon arguments --registry-rps and --registry-burst way down (10 and 1). These govern the rate limiting for outgoing image registry requests so it doesn't spam even if it's getting errors.

The errors in question are

textPayload: "ts=2017-10-05T21:14:19Z caller=warming.go:145 component=warmer err="requesting manifests: getting remote manifest: Get https://gcr.io/v2/path/to/repo/manifests/latest: http: non-successful response (status=404 body=\"{\\\"errors\\\":[{\\\"code\\\":\\\"MANIFEST_UNKNOWN\\\",\\\"message\\\":\\\"Manifest with tag 'latest' has media type 'application/vnd.docker.distribution.manifest.v2+json', but client accepts 'application/vnd.docker.distribution.manifest.v1+json'.\\\"}]}\")"

That last bit is the clue: GCR appear to no longer be serving v1 schema manifests. We assume v1 manifests are available for all images (which seems to be true elsewhere). On the other hand, dockerhub don't seem to serve schema2 manifests for all images, only relatively recent ones. Sigh. Might have to look at the Content-Type to see which.

https://gist.github.com/squaremo/b3b832d6a3b9e0cc07fb35b854a66d47 has a program I used to see what the schema2 manifests look like, using the same docker registry client we use (https://github.com/heroku/docker-registry-client).

If we can't use schema1 manifests, we can use schema2. It takes an extra step, to fetch the Config blob as shown in the gist, but that has the information we need. It looks like this:

{
  "architecture": "amd64",
  "author": "Weaveworks Inc <help@weave.works>",
  "config": {
    // ...
  },
  "container": "1aa5661ab22a2b34c0e785d94f24c61e129266dfbfa80b2b8e1926f96256966c",
  "container_config": {
    // ...
  },
  "created": "2016-11-29T16:03:39.813618407Z",
  // ...
}

NB the created datetime (though we should check it matches what we get from the schema1 manifests).

errordeveloper · 2017-10-09T17:25:07Z

Also, I happen to look at the code that gets auth token and learned there is a client library we can use, which might have proper means of refreshing the token based on the TTL.

This is how the client library can be used:

import (
	"errors"
	"golang.org/x/oauth2"
	"golang.org/x/oauth2/google"
)

func tokenFromEnv() (string, error) {
	var GCRScopes = []string{"https://www.googleapis.com/auth/cloud-platform"} // or whatever scopes you want
	var OAuthHTTPContext = oauth2.NoContext
	ts, err := google.DefaultTokenSource(OAuthHTTPContext, GCRScopes...)
	if err != nil {
		return "", err
	}
	token, err := ts.Token()
	if err != nil {
		return "", err
	}
	if !token.Valid() {
		return "", errors.New("access_token was invalid")
	}
	return token.AccessToken, nil
}

I've copied that code from another thread I had with some folks at Google. Just FYI, thought this MBOI.

awh · 2017-10-11T15:26:45Z

@jr-qordoba I've backported @squaremo's #780 PR onto the 1.0.1 branch and pushed a preview image to quay.io/weaveworks/flux:1.0.2-pre, any chance you could give it a try before we cut a 1.0.2 final release? Thankyou!

writer-jr · 2017-10-11T20:11:09Z

flux:1.0.2-pre appears to work as expected in my dev, test, and prod clusters. Thanks!

squaremo · 2017-10-23T14:30:44Z

I'm going to close this one, with a coda:

The underlying problem was that all requests to GCR were trivially failing (fixed in large part by a back-ported #780, and latterly #801).

We have rate limiting on the requests we make: 200 a second (you are limited to 200 requests over a 1s window), with 125 burst (i.e., you can have 125 requests in play at once). If all requests fail quickly, that means you can get 125 or so errors in the log in a very short period, which is certainly alarming. It's not necessarily the case that gcr will throttle or otherwise reject requests out of hand, though you may want to tune the rps and burst down for other reasons (https://cloud.google.com/container-registry/pricing for example).

squaremo added bug ☠ high user impact labels Oct 5, 2017

squaremo mentioned this issue Oct 6, 2017

Warmer can't handle images with non linux/amd64 tags #741

Closed

samb1729 mentioned this issue Oct 6, 2017

Support image names containing slashes #767

Closed

squaremo mentioned this issue Oct 6, 2017

Use schema2 manifest when available #780

Closed

squaremo closed this as completed Oct 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache warming exceeds GCR rate limiting #778

Cache warming exceeds GCR rate limiting #778

jpellizzari commented Oct 5, 2017

jpellizzari commented Oct 5, 2017

writer-jr commented Oct 5, 2017

squaremo commented Oct 5, 2017

errordeveloper commented Oct 9, 2017

awh commented Oct 11, 2017

writer-jr commented Oct 11, 2017

squaremo commented Oct 23, 2017

Cache warming exceeds GCR rate limiting #778

Cache warming exceeds GCR rate limiting #778

Comments

jpellizzari commented Oct 5, 2017

jpellizzari commented Oct 5, 2017

writer-jr commented Oct 5, 2017

squaremo commented Oct 5, 2017

errordeveloper commented Oct 9, 2017

awh commented Oct 11, 2017

writer-jr commented Oct 11, 2017

squaremo commented Oct 23, 2017