Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

unable to fetch node CPU metrics #2605

Closed
texasbobs opened this issue Nov 12, 2019 · 9 comments · Fixed by #2606
Closed

unable to fetch node CPU metrics #2605

texasbobs opened this issue Nov 12, 2019 · 9 comments · Fixed by #2606
Labels

Comments

@texasbobs
Copy link

We are using Flux 1.13.2 and in some clusters, it no longer clones repos to be synced. The error seems to indicate it is unable to execute a query on the Prometheus api.

ts=2019-11-12T17:00:47.74181943Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-11-12T17:00:54.041654513Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2019-11-12T17:04:33.083254717Z caller=loop.go:111 component=sync-loop event=refreshed url=git@github.myrepo.git branch=master HEAD=3e9
ts=2019-11-12T17:06:05.738649349Z caller=loop.go:85 component=sync-loop err="collating resources in cluster for sync: Error while fetching node metrics for selector : unable to fetch node CPU metrics: unable to execute query: Get http://prometheus-k8s.monitoring.svc:9090/api/v1/query?query=sum%281+-+rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B1m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22url%7C url%7C url%22%7D%29+by+%28node%29&time=1573578364.541: dial tcp 10.233.9.37:9090: 
connect: connection refused"

How can we prevent this from stopping the sync?

Why does Flux care about CPU metrics at all?

It looked like there was a PR to ignore these types of errors back in 1.12.2. #2009

@texasbobs texasbobs added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Nov 12, 2019
@squaremo
Copy link
Member

Why does Flux care about CPU metrics at all?

It doesn't. Possibly this is an error reported by the Kubernetes API client, either from something it is attempting, or something attempted by an intermediary, or the API server itself.

@stefanprodan
Copy link
Member

@squaremo it looks like our use of the discovery API triggers the metrics queries, this is indeed very troubling since a busy cluster with many metrics will add a huge delay to Flux sync.

@squaremo
Copy link
Member

it looks like our use of the discovery API triggers the metrics queries

How .. even ... I don't ...

@stefanprodan
Copy link
Member

I think we should be using ServerGroups() and call ServerResourcesForGroupVersion for each group excluding the metrics one to avoid querying the metrics providers.

@stefanprodan
Copy link
Member

@texasbobs can you please test stefanprodan/flux:fix-discovery.1 in your cluster and let me know if the sync works? Thanks

@texasbobs
Copy link
Author

What version is that based on, @stefanprodan ? We have not tested beyond 1.13.2 and this cluster is in a prod environment.

@stefanprodan
Copy link
Member

It's based on master. If you have a dev cluster, can you please scale prometheus to zero and test it out?

@texasbobs
Copy link
Author

I confirmed that it fails in my test cluster on 1.13.2 and 1.15.0. The above image was able to function correctly.

@stefanprodan
Copy link
Member

@texasbobs thanks a lot for testing it, I'm also running my own tests with prometheus-adapter and metrics-server.

@stefanprodan stefanprodan removed the blocked-needs-validation Issue is waiting to be validated before we can proceed label Nov 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants