Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere Input does not collect datastore metrics #4789

Closed
photinus opened this issue Oct 2, 2018 · 52 comments
Closed

vSphere Input does not collect datastore metrics #4789

photinus opened this issue Oct 2, 2018 · 52 comments
Labels
area/vsphere bug unexpected problem or unintended behavior

Comments

@photinus
Copy link

photinus commented Oct 2, 2018

Relevant telegraf.conf:

[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 1000
flush_buffer_when_full = true
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
debug = true
quiet = false
logfile = "/Program Files/Telegraf/telegraf.log"

hostname = ""
[[outputs.influxdb]]
urls = ["udp://10.120.1.44:8089"]

[[inputs.vsphere]]
vcenters = [ "https://vsphere.address.here/sdk" ]
username = "vsphereUSer"
password = "SuperSecretVspherePasswordOfGreatness"

datastore_metric_include = [ "*" ]

vm_metric_exclude = [ "*" ]

host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"sys.uptime.latest",
]
host_metric_exclude = [] ## Nothing excluded by default
host_instances = true ## true by default

cluster_metric_exclude = [""] ## Nothing excluded by default
cluster_instances = true ## true by default
datacenter_metric_exclude = [ "
" ] ## Datacenters are not collected by default.

collect_concurrency = 4
discover_concurrency = 2

timeout = "20s"

insecure_skip_verify = true

System info:

Telegraf 1.8.0
Windows Server 2012 r2
vSphere Appliance 6.5 u1d

Steps to reproduce:

  1. Configured Telegraf to collect just host and datastore metrics
  2. Telegraf writes metrics for hosts, but no metrics for datastores

Expected behavior:

Datastore metrics are written to influxdb

Actual behavior:

No datastore metrics are written to influxdb

Additional info:

Logs:
2018-10-02T20:40:04Z D! Attempting connection to output: influxdb
2018-10-02T20:40:04Z D! Successfully connected to output: influxdb
2018-10-02T20:40:04Z I! Starting Telegraf 1.8.0
2018-10-02T20:40:04Z I! Loaded inputs: inputs.vsphere
2018-10-02T20:40:04Z I! Loaded aggregators:
2018-10-02T20:40:04Z I! Loaded processors:
2018-10-02T20:40:04Z I! Loaded outputs: influxdb
2018-10-02T20:40:04Z I! Tags enabled: host=telegraf
2018-10-02T20:40:04Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
2018-10-02T20:40:10Z D! [input.vsphere]: Starting plugin
2018-10-02T20:40:10Z D! [input.vsphere]: Creating client: vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-02 13:35:10.1675805 -0700 PDT m=-292.934584699
2018-10-02T20:40:10Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere]: Discover new objects for vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for datacenter
2018-10-02T20:40:10Z D! [input.vsphere]: No parent found for Folder:group-d1 (ascending from Folder:group-d1)
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for cluster
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for host
2018-10-02T20:40:11Z D! [input.vsphere] Discovering resources for vm
2018-10-02T20:40:11Z D! [input.vsphere] Discovering resources for datastore
2018-10-02T20:40:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-10-02T20:40:20Z D! [input.vsphere]: Latest: 2018-10-02 13:40:10.1675805 -0700 PDT m=+7.065415301, elapsed: 14.846599, resource: datastore
2018-10-02T20:40:20Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed for vsphere.address.here

@rsasportes
Copy link

rsasportes commented Oct 3, 2018

I have the same exact issue here.
My log shows mostly the same result as above.

Environment : VCenter 6.7 and Vsphere 6.7
Storage : Datastore located on NFS

@prydin
Copy link
Contributor

prydin commented Oct 3, 2018

Did you let it run for a while? It misses the first collection because background object discovery hasn't finished yet. You can force the collector to wait for the first round of discovery by setting this flag in the config:

force_discover_on_init = true

Let me know if this solved your problem.

@danielnelson danielnelson added bug unexpected problem or unintended behavior need more info area/vsphere labels Oct 3, 2018
@rsasportes
Copy link

rsasportes commented Oct 4, 2018

Hi,

I've now applied your trick, and my telegraf.log is now a little more explicit about what's happening. Here it is :



018-10-04T06:01:47Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-04 05:56:47.096487102 +0000 UTC m=-279.309774333
2018-10-04T06:01:47Z D! [input.vsphere]: Collecting metrics for 40 objects of type datastore for
2018-10-04T06:01:47Z D! [input.vsphere]: Querying 37 objects, 256 metrics (3 remaining) of type datastore for 02-sys-v278.chjltn.local. Processed objects: 37. Total objects 40
2018-10-04T06:01:47Z D! Output [influxdb] wrote batch of 1000 metrics in 42.794393ms
2018-10-04T06:01:50Z D! Output [influxdb] buffer fullness: 159 / 10000 metrics.
2018-10-04T06:01:50Z D! Output [influxdb] wrote batch of 159 metrics in 21.619734ms
2018-10-04T06:01:55Z E! Error in plugin [inputs.vsphere]: took longer to collect than collection interval (10s)
2018-10-04T06:02:00Z D! Output [influxdb] buffer fullness: 19 / 10000 metrics.
2018-10-04T06:02:00Z D! Output [influxdb] wrote batch of 19 metrics in 7.921797ms

@rsasportes
Copy link

I have extended the interval between collections. and now I have another set of errors :



2018-10-04T06:48:16Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-04 06:43:16.837579961 +0000 UTC m=-178.066128366
2018-10-04T06:48:16Z D! [input.vsphere]: Collecting metrics for 40 objects of type datastore for VCENTER
2018-10-04T06:48:16Z D! [input.vsphere]: Querying 37 objects, 256 metrics (3 remaining) of type datastore for VCENTER. Processed objects: 37. Total objects 40
2018-10-04T06:48:20Z D! Output [influxdb] buffer fullness: 565 / 10000 metrics.
2018-10-04T06:48:20Z D! Output [influxdb] wrote batch of 565 metrics in 34.680264ms
2018-10-04T06:48:30Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-04T06:48:36Z D! [input.vsphere]: Query returned 0 metrics

@prydin
Copy link
Contributor

prydin commented Oct 4, 2018

@rsasportes I don't see any errors in that log. Just debug messages. Am I missing something?

@prydin
Copy link
Contributor

prydin commented Oct 4, 2018

Oh! Now I see the problem! The query isn't returning any data. Which version of Telegraf are you on? 1.8 has a bug in it that can cause queries to return 0 objects of the time on the node where Telegraf runs is ahead of vCenter. This is fixed in 1.8.1 that was just released.

@rsasportes
Copy link

Hi,

First, thanks for your help ;-)

I've upgraded Telegraf to 1.8.1_1, and now some datastores start to appear in the dashboard.
Unfortunately, it is only a small amount of them (3 over 40).

As it says in the log :



Latest: 2018-10-05 07:45:17.898618 +0000 UTC, elapsed: 304.955344, resource: datastore
2018-10-05T07:50:17Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-05 07:45:17.898618 +0000 UTC
2018-10-05T07:50:17Z D! [input.vsphere]: Collecting metrics for 40 objects of type datastore for VCENTER
2018-10-05T07:50:17Z D! [input.vsphere]: Querying 37 objects, 256 metrics (3 remaining) of type datastore for VCENTER. Processed objects: 37. Total objects 40
2018-10-05T07:50:17Z D! Output [influxdb] wrote batch of 1000 metrics in 42.750677ms
2018-10-05T07:50:20Z D! Output [influxdb] buffer fullness: 165 / 10000 metrics.
2018-10-05T07:50:20Z D! Output [influxdb] wrote batch of 165 metrics in 19.038192ms
2018-10-05T07:50:22Z D! [input.vsphere] Discovering resources for datastore
2018-10-05T07:50:30Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-05T07:50:37Z D! [input.vsphere]: Query returned 0 metrics
2018-10-05T07:50:39Z D! [input.vsphere]: Query returned 0 metrics



I've extended data collection interval to 5 minutes, just to check if there is a timeout. No luck.

Do you think it might be possible to manually launch data collection, alongside verbose logging, and trace what's wrong ?

Again, thanks !

@prydin
Copy link
Contributor

prydin commented Oct 5, 2018

Do you see any metrics at all? Do you see a complete set of metrics for some datastores or do you only see sporadic metrics for random datastores? Also, for anything that's missing, can you go to the vCenter UI and make sure you can see those metrics under Monitoring->Performance?

@Muellerflo
Copy link

Hi,
i have the same issue and only see sporadic metrics in the grafana dashboard.
I also want to exclude some metrics (all local esxi disks which named "hypervisorxxx-local")
datastore_metric_exclude = ["*-local"]
But the plugin still collect metrics for 98 datastores ;)

Thanks.

@prydin
Copy link
Contributor

prydin commented Oct 8, 2018

@Muellerflo the includes and excludes act on metric names, not object names. The ability to filter objects will be added soon. See #4790

As for sporadic metrics, have you checked the log if you get any timeouts/collections that take longer than the interval?

@ion-storm
Copy link

ion-storm commented Oct 12, 2018

We have the same issue, it appears to happen with larger vcenters, one vcenter with only 5 datastores worked pefectly, but the other with dozens of datastores the data collection failed. Any ETA on a fix? This appears to be effecting many others as well.

@prydin
Copy link
Contributor

prydin commented Oct 12, 2018

@ion-storm anything in the logs?

@prydin
Copy link
Contributor

prydin commented Oct 12, 2018

@ion-storm There are several reasons why datastore metrics could be missing. What is your collection interval? Have you tried to declare the plugin separately for the datastores with a longer collection interval?

@ybinnenwegin2ip
Copy link
Contributor

Hello,

We also seem to be experiencing issues with receiving certain data data from certain datastores.

The data that Telegraf stores into the measurement "vsphere_datastore_datastore" does not seem to appear for certain datastores. The data in the measurement "vsphere_datastore_disk" however, does.

So, regarding our setup, we have 25 datastores in total, of which 19 have a type of form "VMFS", the other 6 have a type of "NFS 3". The NFS ones are the ones that don't show up in the "vsphere_datastore_datastore" measurement (in InfluxDB).

When running Telegraf I turned on debug logging and it does discover 25 datastores:

2018-10-17T18:45:34Z D! [input.vsphere]: Collecting metrics for 25 objects of type datastore for VCENTER_INSTANCE

I messed around a bit more in both Telegraf and govmomi and while printing the data that is processed in govmomi's ToMetricSeries function (which seems to fetch numberReadAveraged & numberWriteAveraged, which are stored in "vsphere_datastore_datastore") I only saw the VMFS datastores coming by, none of the NFS datastores.

I haven't dug any deeper yet, but hopefully this will help someone along their way. :)

If anyone wants me to try things out, I'm in the CEST timezone.

@prydin
Copy link
Contributor

prydin commented Oct 17, 2018

@ybinnenwegin2ip What's the statistics level on your vCenter? I believe you have to be at least at level 3 for those metrics to be collected.

You could also try to check the metrics using the govc tool. Something like this:

govc metric.sample -n 10 /DC/datastore/myds datastore.numberWriteAveraged.average

If that doesn't return any metrics, you're simply not collecting them on your vCenter and you'd have to increase the statistics level for the 5 minute buckets.

@ybinnenwegin2ip
Copy link
Contributor

ybinnenwegin2ip commented Oct 17, 2018

@prydin

Thanks for your quick response!

I gave it a shot and this is all I get back:

$ ./govc_linux_amd64 metric.sample /DC_NAME/datastore/DATASTORE_NAME datastore.numberReadAveraged.average
DATASTORE_NAME  -  datastore.numberWriteAveraged.average      num

I'll look into the statistics level, thanks for the pointer!

EDIT:

I noticed your edit, I think you added the -n 10?

Either way, I ran it again, didn't change:

$ ./govc_linux_amd64 metric.sample -n 10 /DC_NAME/datastore/DATASTORE_NAME datastore.numberReadAveraged.average
DATASTORE_NAME  -  datastore.numberReadAveraged.average      num

I've also just increased the statistics level from "1" to "2", let's see what happens. :)

@prydin
Copy link
Contributor

prydin commented Oct 17, 2018

I think you need at least 3. Just verified in my lab. If I drop it lower than 3, the metric disappears.

@ybinnenwegin2ip
Copy link
Contributor

I think you need at least 3. Just verified in my lab. If I drop it lower than 3, the metric disappears.

Thanks! It's set to 2 now and while I do see some statistics appearing (read_average & write_average) the others are indeed still missing. Perhaps I misunderstood the VMware documentation :)

Level 2

Disk – All metrics, excluding numberRead and numberWrite. 

(https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.monitoring.doc/GUID-25800DE4-68E5-41CC-82D9-8811E27924BC.html)

I guess a networked datastore doesn't count as a disk but rather a 'device' then?

Either way, I'll set it to 3 soon and I'll report back in here. Thanks a lot for pointing me in the right direction!

@prydin
Copy link
Contributor

prydin commented Oct 17, 2018

Happy to help. You're not the only one who's confused about the documentation around this. :)

@jvigna
Copy link

jvigna commented Oct 25, 2018

Hi,

I've exactly the same issue on my vcenter NO datastore metrics are collected. I use 1.8.2 version of telegraf. Also activating the debug option gives no hint why no metrics are collected. I see this lines maybe it's a hint:

2018-10-25T09:51:34Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-25 09:46:34.829154 +0000 UTC
2018-10-25T09:51:34Z D! [input.vsphere]: Collecting metrics for 104 objects of type datastore for vcenter
2018-10-25T09:51:34Z D! [input.vsphere]: Querying 37 objects, 256 metrics (3 remaining) of type datastore for vcenter. Processed objects: 37. Total objects 104
2018-10-25T09:51:34Z D! [input.vsphere]: Querying 38 objects, 256 metrics (6 remaining) of type datastore for vcenter. Processed objects: 74. Total objects 104
2018-10-25T09:51:54Z D! [input.vsphere]: Query returned 0 metrics
2018-10-25T09:51:54Z D! [input.vsphere]: Query returned 0 metrics
2018-10-25T09:51:54Z D! [input.vsphere]: Query returned 0 metrics

How could I debug this better?

This is the output of the govc command:

./govc_linux_amd64 metric.sample -n 10 itasz7_01 datastore.numberReadAveraged.average

itasz7_01 - datastore.numberReadAveraged.average num
itasz7_01 naa.6000144000000010f00d71fe54eeb02a datastore.numberReadAveraged.average num
itasz7_01 - datastore.numberReadAveraged.average num
itasz7_01 naa.6000144000000010f00d71fe54eeb02a datastore.numberReadAveraged.average num
itasz7_01 - datastore.numberReadAveraged.average num
itasz7_01 naa.6000144000000010f00d71fe54eeb02a datastore.numberReadAveraged.average num
itasz7_01 - datastore.numberReadAveraged.average num
itasz7_01 naa.6000144000000010f00d71fe54eeb02a datastore.numberReadAveraged.average num

@prydin
Copy link
Contributor

prydin commented Oct 25, 2018

@jvigna It looks like Govc isn't returning any metrics either. Have you tried decreasing the statistics level for 5 minute samples in vCenter?

@jvigna
Copy link

jvigna commented Oct 25, 2018

It's strange I now inserted 2 more vcenters (smaller ones) and they work without problems. I don't think it is a setting on vcenter site, could it be that there are too much datastores?

@jvigna
Copy link

jvigna commented Oct 25, 2018

BTW.: In first I'm interested in the disk infos as:

./govc_linux_amd64 metric.sample itasz7_01 disk.capacity.latest
itasz7_01 - disk.capacity.latest 4294705152 KB
itasz7_01 - disk.capacity.latest 4294705152 KB
itasz7_01 - disk.capacity.latest 4294705152 KB
itasz7_01 - disk.capacity.latest 4294705152 KB

And they seem to work.

@prydin
Copy link
Contributor

prydin commented Oct 25, 2018

What does your config file look like? Also, please check the stats levels on both vcenters to see if there's a difference.

@jvigna
Copy link

jvigna commented Oct 26, 2018

Hi the stats level are the same on the 3 vcenter servers and my config is this:

[[inputs.vsphere]]
vcenters = [ "https://vcenter/sdk" ]
username = "user@domain"
password = "password"
vm_metric_include = [
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.run.summation",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.wait.summation",
"mem.active.average",
"mem.granted.average",
"mem.latency.average",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.usage.average",
"power.power.average",
"virtualDisk.numberReadAveraged.average",
"virtualDisk.numberWriteAveraged.average",
"virtualDisk.read.average",
"virtualDisk.readOIO.latest",
"virtualDisk.throughput.usage.average",
"virtualDisk.totalReadLatency.average",
"virtualDisk.totalWriteLatency.average",
"virtualDisk.write.average",
"virtualDisk.writeOIO.latest",
"sys.uptime.latest",
]
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"disk.deviceReadLatency.average",
"disk.deviceWriteLatency.average",
"disk.kernelReadLatency.average",
"disk.kernelWriteLatency.average",
"disk.numberReadAveraged.average",
"disk.numberWriteAveraged.average",
"disk.read.average",
"disk.totalReadLatency.average",
"disk.totalWriteLatency.average",
"disk.write.average",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"storageAdapter.numberWriteAveraged.average",
"storageAdapter.read.average",
"storageAdapter.write.average",
"sys.uptime.latest",
]
cluster_metric_include = [] ## if omitted or empty, all metrics are collected
datastore_metric_include = [] ## if omitted or empty, all metrics are collected
datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
insecure_skip_verify = true

@prydin
Copy link
Contributor

prydin commented Oct 26, 2018

@jvigna You're collecting all metrics for datastores. That can take a long time. What's your collection interval?

I think what's happening is that the collection takes longer than the collection interval. You have two options:

  1. Reduce the number of datastore metrics you're collecting using datastore_metric_include
  2. Create two instances if the input.vsphere plugin. One with e.g. 20s interval for hosts and vms and one with 300s for datastores and clusters. Those resources only report metrics every 300s anyway, so you don't lose anything (other than making the config file slightly more complex)

@jvigna
Copy link

jvigna commented Oct 26, 2018

I will try this, but just for the record shouldn't I get some sort of warning if "really" such a timeout is is hit?

BTW: My collection interval is already 30s as with 10s i GOT that warning. And I already have because of this an own instance of telegraf for just collecting the vsphere metrics.

@prydin
Copy link
Contributor

prydin commented Oct 26, 2018

Yes, there should be errors in the logfile if this is the issue.

Can you try to set datastore_metric_include to just the metric you're interested in, e.g.

datastore_metric_include = [ "disk.capacity.latest" ]

Could you please try that and tell me what the result is and paste a logfile if it doesn't work?

@jvigna
Copy link

jvigna commented Oct 26, 2018

Ok I think that could be a good idea, what do I need for capacity? disk.capacity.latest and disk.used.latest? How may I get a list of the metrics?

@prydin
Copy link
Contributor

prydin commented Oct 26, 2018

Those two are good candidates. To list all metrics available, use the following govc command:

govc metric.ls tasz7_01

@jvigna
Copy link

jvigna commented Oct 26, 2018

As soon as I'm able to modify the configuration I'll let you know if it works when sending only a few metrics.
Thanks!

@prydin
Copy link
Contributor

prydin commented Oct 29, 2018

Keep in mind that it may take up to 30 minutes to see any data on storage capacity, since these are only generated at a 30 minute interval by vCenter.

@Compboy100
Copy link

Hi @prydin I've been using the plugin with a few day's now and have also had this issue.

I've tried to create 2 instance like this.
[[inputs.vsphere]]
interval = "301s"
datastore_metric_include = []
force_discover_on_init = true

[[inputs.vsphere]]
interval = "301s"
datastore_metric_include = []
force_discover_on_init = true
datastore_metric_include = [ "disk.capacity.latest", "disk.used.latest", "disk.provisioned.latest", ]
datacenter_metric_include = []
max_query_objects = 64
max_query_metrics = 64
collect_concurrency = 3
discover_concurrency = 3
force_discover_on_init = false
object_discovery_interval = "300s"
timeout = "301s"

I finally got some data showing but only if i choose 7day's or more. everyting below that does not show content.

@prydin
Copy link
Contributor

prydin commented Oct 29, 2018

@Compboy100 Which version of the plugin?

@Compboy100
Copy link

Hi thank you for the quick reply.

2018-10-29T14:46:15Z I! Starting Telegraf 1.8.2
sometimes i see this also in the logs.
2018-10-29T14:47:00Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore
2018-10-29T14:48:00Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed for host

Eventhou i have the interval at 301.
Like mentioned datastores are now at least populated. above 7day's interval.
dashboard used from Mr De La Cruz https://jorgedelacruz.uk/2018/10/01/looking-for-the-perfect-dashboard-influxdb-telegraf-and-grafana-part-xii-native-telegraf-plugin-for-vsphere/

@prydin
Copy link
Contributor

prydin commented Oct 29, 2018

I think I've tracked this down to minor clock skew between vCenter and the ESXi hosts. Working on a fix.

@prydin
Copy link
Contributor

prydin commented Nov 1, 2018

I've been working on this over the last few days and addressed multiple issues:

  1. vCenter is sometimes (very) late posting metrics, especially when running under high load. Metrics can be as much as 15 minutes delayed. We've addressed this by applying a "lookback", i.e. fetching a few sample periods back every time we query metrics. Surprisingly, this doesn't seem to have a significant performance impact and solves this issue.
  2. Data collections could time out without an error message (regression). I have suspicion that some of the issues reported may have to be caused by that.
  3. vCenter 6.5 seems to over-estimate the size of a query for cluster metrics and reject it. Solved this by decreasing the query batch size for cluster queries.

@prydin
Copy link
Contributor

prydin commented Nov 1, 2018

Anyone who wants to be a beta tester for the fix? It's available here:

https://github.com/prydin/telegraf/releases/tag/prydin-4789

@Compboy100
Copy link

Compboy100 commented Nov 1, 2018

Thank you will try it,

Can I just extract the vsphere plugins folder to my linux distro? or do i have to compile the whole thing.

@prydin
Copy link
Contributor

prydin commented Nov 1, 2018

It's binaries. Nothing to compile.

@danielnelson
Copy link
Contributor

@prydin Feel free to make a PR too, this will build all the packages on CircleCI and I can add links to the artifacts. Just note in the PR that its still preliminary.

@prydin
Copy link
Contributor

prydin commented Nov 1, 2018

@danielnelson Will do. On the go today, but I'll get a PR filed as some as I get back to home base.

@Compboy100
Copy link

Compboy100 commented Nov 1, 2018

No luck getting data below 24h yet. will report more tomorrow when back in office.
2018-11-02T13:32:00Z D! [input.vsphere]: Starting plugin
2018-11-02T13:32:00Z D! [input.vsphere]: Running initial discovery and waiting for it to finish
2018-11-02T13:32:00Z D! [input.vsphere]: Discover new objects for
2018-11-02T13:32:00Z D! [input.vsphere] Discovering resources for host
2018-11-02T13:32:00Z D! [input.vsphere] Discovering resources for vm
2018-11-02T13:32:01Z D! [input.vsphere] Discovering resources for datastore
2018-11-02T13:32:04Z D! [input.vsphere] Discovering resources for datacenter
2018-11-02T13:32:05Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for
2018-11-02T13:37:11Z E! Error in plugin [inputs.vsphere]: ServerFaultCode: This operation is restricted by the administrator - 'vpxd.stats.maxQueryMetrics'. Contact your system administrator

2018-11-02T13:42:07Z D! [input.vsphere] Discovering resources for datastore
2018-11-02T13:43:05Z D! [input.vsphere]: Latest: 2018-11-02 13:37:18.360235 +0000 UTC, elapsed: 364.880925, resource: datastore
2018-11-02T13:43:05Z D! [input.vsphere]: Collecting metrics for 16 objects of type datastore for
2018-11-02T13:43:05Z D! [input.vsphere]: Queuing query: 16 objects, 48 metrics (0 remaining) of type datastore for . Total objects 16 (final chunk)
2018-11-02T13:43:05Z D! [input.vsphere] Query for datastore returned metrics for 16 objects
2018-11-02T13:43:05Z D! [input.vsphere] CollectChunk for datastore returned 48 metrics
2018-11-02T13:44:05Z D! [input.vsphere]: Latest: 2018-11-02 13:43:18.24116 +0000 UTC, elapsed: 65.203200, resource: datastore
2018-11-02T13:44:05Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed on
2018-11-02T13:45:05Z D! [input.vsphere]: Latest: 2018-11-02 13:43:18.24116 +0000 UTC, elapsed: 125.226100, resource: datastore
2018-11-02T13:45:05Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed on

@prydin
Copy link
Contributor

prydin commented Nov 2, 2018

Ah! That one is easy to fix. I'm assuming you're running an older version of vCenter? Go ahead and set max_query_metrics to 64.

Like this:
max_query_metrics = 64

If that doesn't work, try decreasing it to 20.

@Compboy100
Copy link

Yes 6.0. Had already put them to 64.
Decreased them now to 40. Will change to 20 if problem presist.

@Compboy100
Copy link

Can confirm that data is being recorded for below 24h now.
Thank you @prydin

@Compboy100
Copy link

@prydin Don't forget the PR. there is a RC1 live. does it include the changes?
From my side I am getting better stats now with the test release.
There are sill some minor issues
image
But at least i'm getting data now.

@prydin
Copy link
Contributor

prydin commented Nov 6, 2018

I've been wanting this run in my lab for a while first, but I just opened a PR. @glinton and @danielnelson is there still a chance to get this into 1.9?

@prydin
Copy link
Contributor

prydin commented Nov 6, 2018

@Compboy100 no, RC1 doesn't have the changes. I wanted to make sure it ran OK in my lab first.

@danielnelson
Copy link
Contributor

It's a possibility, if not for 1.9.0 then it should be possible to get this in for 1.9.1. Let's focus on getting it added to master first, I'll review today.

@danielnelson
Copy link
Contributor

Closed in #4968, @Compboy100 I'm creating a new release candidate this afternoon.

@danielnelson
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

9 participants