Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more prometheus metrics #141

Merged
merged 10 commits into from
Feb 18, 2024
Merged

Conversation

msbarry
Copy link
Contributor

@msbarry msbarry commented Feb 14, 2024

Add more prometheus metrics to pmtiles server:

  • overall HTTP request duration, response size, and count by status/archive/handler
  • underlying bucket request duration and count by status/archive
  • directory cache size/limit, requests by hit/miss/inflight status

Also fixed a few things I found testing out the corner cases:

  • translate any client cancellation exceptions to a 499 http response code, otherwise they are 500 and look like server failures in the stats
  • return a 204 for OPTIONS requests since if there's a body the client will cancel reading it
  • return a 405 for any other request methods besides GET/HEAD/OPTIONS - it doesn't seem like those should be supported?

The impact on performance seems to be negligible, tilesiege with 1.25m requests was around 37k qps before/after this change.

Example /metrics response after a bunch of requests to "planet" archive that exists and "plaet" archive that does not
# HELP pmtiles_bucket_request_duration_seconds Request duration in seconds for individual requests to the underlying bucket
# TYPE pmtiles_bucket_request_duration_seconds histogram
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.005"} 40
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.01"} 51
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.025"} 70
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.05"} 116
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.1"} 126
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.25"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="0.5"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="1"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="2.5"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="5"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="10"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="",status="404",le="+Inf"} 130
pmtiles_bucket_request_duration_seconds_sum{archive="",status="404"} 3.3402572059999986
pmtiles_bucket_request_duration_seconds_count{archive="",status="404"} 130
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.005"} 0
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.01"} 0
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.025"} 2
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.05"} 32
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.1"} 94
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.25"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="0.5"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="1"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="2.5"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="5"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="10"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="206",le="+Inf"} 118
pmtiles_bucket_request_duration_seconds_sum{archive="planet",status="206"} 8.975607534
pmtiles_bucket_request_duration_seconds_count{archive="planet",status="206"} 118
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.005"} 0
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.01"} 0
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.025"} 0
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.05"} 5
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.1"} 6
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.25"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="0.5"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="1"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="2.5"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="5"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="10"} 8
pmtiles_bucket_request_duration_seconds_bucket{archive="planet",status="canceled",le="+Inf"} 8
pmtiles_bucket_request_duration_seconds_sum{archive="planet",status="canceled"} 0.512169418
pmtiles_bucket_request_duration_seconds_count{archive="planet",status="canceled"} 8
# HELP pmtiles_bucket_requests_total Requests to the underlying bucket
# TYPE pmtiles_bucket_requests_total counter
pmtiles_bucket_requests_total{archive="",kind="root",status="404"} 10
pmtiles_bucket_requests_total{archive="",kind="tile",status="404"} 120
pmtiles_bucket_requests_total{archive="planet",kind="leaf",status="206"} 10
pmtiles_bucket_requests_total{archive="planet",kind="root",status="206"} 1
pmtiles_bucket_requests_total{archive="planet",kind="tile",status="206"} 107
pmtiles_bucket_requests_total{archive="planet",kind="tile",status="canceled"} 8
# HELP pmtiles_buildinfo 
# TYPE pmtiles_buildinfo gauge
pmtiles_buildinfo{revision="none",version="dev"} 1
# HELP pmtiles_buildtime 
# TYPE pmtiles_buildtime gauge
pmtiles_buildtime 0
# HELP pmtiles_dir_cache_entries Number of directories in the cache
# TYPE pmtiles_dir_cache_entries gauge
pmtiles_dir_cache_entries 12
# HELP pmtiles_dir_cache_limit_bytes Maximum directory cache size limit in bytes
# TYPE pmtiles_dir_cache_limit_bytes gauge
pmtiles_dir_cache_limit_bytes 6.4e+07
# HELP pmtiles_dir_cache_requests Requests to the directory cache by archive and status (hit/miss)
# TYPE pmtiles_dir_cache_requests counter
pmtiles_dir_cache_requests{archive="planet",status="hit"} 694
pmtiles_dir_cache_requests{archive="planet",status="miss"} 11
pmtiles_dir_cache_requests{archive="plet",status="hit"} 37
pmtiles_dir_cache_requests{archive="plet",status="miss"} 10
# HELP pmtiles_dir_cache_size_bytes Current directory cache usage in bytes
# TYPE pmtiles_dir_cache_size_bytes gauge
pmtiles_dir_cache_size_bytes 3.791191e+06
# HELP pmtiles_request_duration_seconds Overall request duration in seconds
# TYPE pmtiles_request_duration_seconds histogram
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.005"} 7
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.01"} 8
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.025"} 22
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.05"} 36
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.1"} 39
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.25"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="0.5"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="1"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="2.5"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="5"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="10"} 47
pmtiles_request_duration_seconds_bucket{archive="",handler="tile",status="404",le="+Inf"} 47
pmtiles_request_duration_seconds_sum{archive="",handler="tile",status="404"} 2.1503623300000005
pmtiles_request_duration_seconds_count{archive="",handler="tile",status="404"} 47
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.005"} 0
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.01"} 0
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.025"} 1
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.05"} 12
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.1"} 41
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.25"} 63
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="0.5"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="1"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="2.5"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="5"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="10"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="200",le="+Inf"} 64
pmtiles_request_duration_seconds_sum{archive="planet",handler="tile",status="200"} 6.047602334
pmtiles_request_duration_seconds_count{archive="planet",handler="tile",status="200"} 64
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.005"} 0
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.01"} 0
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.025"} 0
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.05"} 3
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.1"} 29
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.25"} 41
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="0.5"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="1"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="2.5"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="5"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="10"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="304",le="+Inf"} 43
pmtiles_request_duration_seconds_sum{archive="planet",handler="tile",status="304"} 4.370763833
pmtiles_request_duration_seconds_count{archive="planet",handler="tile",status="304"} 43
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.005"} 5
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.01"} 17
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.025"} 39
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.05"} 106
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.1"} 124
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.25"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="0.5"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="1"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="2.5"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="5"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="10"} 128
pmtiles_request_duration_seconds_bucket{archive="planet",handler="tile",status="canceled",le="+Inf"} 128
pmtiles_request_duration_seconds_sum{archive="planet",handler="tile",status="canceled"} 4.8165444530000014
pmtiles_request_duration_seconds_count{archive="planet",handler="tile",status="canceled"} 128
# HELP pmtiles_requests_total Overall number of requests to the service
# TYPE pmtiles_requests_total counter
pmtiles_requests_total{archive="",handler="CONNECT",status="405"} 1
pmtiles_requests_total{archive="",handler="OPTIONS",status="204"} 253
pmtiles_requests_total{archive="",handler="POST",status="405"} 1
pmtiles_requests_total{archive="",handler="tile",status="404"} 47
pmtiles_requests_total{archive="planet",handler="tile",status="200"} 64
pmtiles_requests_total{archive="planet",handler="tile",status="304"} 43
pmtiles_requests_total{archive="planet",handler="tile",status="canceled"} 128
# HELP pmtiles_response_size_bytes Overall response size in bytes
# TYPE pmtiles_response_size_bytes histogram
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="1024"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="5120"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="10240"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="25600"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="51200"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="102400"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="256000"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="512000"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="1.048576e+06"} 47
pmtiles_response_size_bytes_bucket{archive="",handler="tile",status="404",le="+Inf"} 47
pmtiles_response_size_bytes_sum{archive="",handler="tile",status="404"} 799
pmtiles_response_size_bytes_count{archive="",handler="tile",status="404"} 47
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="1024"} 2
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="5120"} 6
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="10240"} 11
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="25600"} 18
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="51200"} 33
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="102400"} 55
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="256000"} 64
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="512000"} 64
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="1.048576e+06"} 64
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="200",le="+Inf"} 64
pmtiles_response_size_bytes_sum{archive="planet",handler="tile",status="200"} 3.403524e+06
pmtiles_response_size_bytes_count{archive="planet",handler="tile",status="200"} 64
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="1024"} 0
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="5120"} 0
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="10240"} 1
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="25600"} 6
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="51200"} 18
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="102400"} 34
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="256000"} 43
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="512000"} 43
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="1.048576e+06"} 43
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="304",le="+Inf"} 43
pmtiles_response_size_bytes_sum{archive="planet",handler="tile",status="304"} 2.787782e+06
pmtiles_response_size_bytes_count{archive="planet",handler="tile",status="304"} 43
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="1024"} 127
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="5120"} 127
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="10240"} 127
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="25600"} 127
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="51200"} 127
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="102400"} 128
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="256000"} 128
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="512000"} 128
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="1.048576e+06"} 128
pmtiles_response_size_bytes_bucket{archive="planet",handler="tile",status="canceled",le="+Inf"} 128
pmtiles_response_size_bytes_sum{archive="planet",handler="tile",status="canceled"} 102841
pmtiles_response_size_bytes_count{archive="planet",handler="tile",status="canceled"} 128

pmtiles/server.go Outdated Show resolved Hide resolved
pmtiles/server.go Show resolved Hide resolved
if ok, key := parseMetadataPath(path); ok {
return server.getMetadata(ctx, httpHeaders, key)
archive, handler = key, "tile"
status, headers, data = server.getTile(ctx, headers, key, z, x, y, ext)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern here is that someone could make a bunch of random requests to nonexistent archives and all of the 404's for unique resources would blow up the metric response size. To avoid that, we could either:

  1. try to map all "archive does not exist" responses to handler="404" archive="" (but we don't want to miss logging legit failures about a bad connection to an archive you care about)
  2. or we could just recommend that applications put this behind a proxy that allows only requests to archives you know exist

Thoughts @bdon ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer 1). for 2) we would have to recommend they put the whole go-pmtiles application behind a allowlist-proxy, right? that is too strong of a limitation imho.

for 1) I think that discovering failures about missing archives seems outside of the scope of performance metrics, I don't have a lot of experience in how people use prometheus though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I want to catch is if people are reading from an archive, and it suddenly goes missing or server starts returning 500 errors we want to able to see which archive is no longer working. We could limit the per-archive metrics to only archives that we've gotten an OK response for the root directory, but that would miss if the server restarts and fails to read at startup... it would also mean 2 metrics you need to alert on to know if ties are down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to limit the universe of stored tileset names to ~100?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://prometheus.io/docs/practices/naming/#labels seems to recommend against unbounded cardinality... I feel like detecting a spike in 404s/500s and inspecting logs is good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, changed bucket and overall request metrics to set archive to "" on not found errors.


func createMetrics(scope string, logger *log.Logger) *metrics {
namespace := "pmtiles"
durationBuckets := prometheus.DefBuckets
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default buckets are 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

pmtiles/server_metrics.go Outdated Show resolved Hide resolved
pmtiles/server.go Outdated Show resolved Hide resolved
@msbarry msbarry marked this pull request as draft February 16, 2024 12:00
@msbarry msbarry marked this pull request as ready for review February 16, 2024 23:27
@bdon bdon merged commit f242507 into protomaps:main Feb 18, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants