[06:07:37] !log tools.masto-collab Updated from 0b1e1a7 to ae62c97 [06:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.masto-collab/SAL [10:29:54] taavi: FYI https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/41 [10:53:34] godog: [10:54:47] I just opened T335943 and would appreciate your quick eyeball on it for any quick hints as to where start looking :-P [10:54:48] T335943: prometheus-openstack-exporter: collected data shows regular null intervals - https://phabricator.wikimedia.org/T335943 [10:59:06] arturo: did you check the prometheus logs in cloudmetrics? if it's timing out to scrape should show there I think [10:59:15] arturo: yeah I think you got it right, also what dcaro said [10:59:34] IIRC default scrape interval is one minute, and prometheus gives up if the scrape takes more than that I think [11:00:25] accessing the web interface via ssh tunnel will show you the errors too if any [11:01:39] I can't seem to find any relevant logs [11:02:59] from default prometheus config it seems the timeout is 10s xd [11:03:01] https://www.irccloud.com/pastebin/2upSFTBK/ [11:03:40] ok, trying the web console. By default scrape logs are not active on the journal/system side apparently [11:04:31] ok [11:05:35] dcaro: interesting, yeah I can't recall if that's the timeout e.g. to wait for an answer or the reply is there and it takes a long time for the metrics [11:05:52] at any rate >60s for metrics is definitely suspicious with the scrape interval [11:06:58] we seem to set it to 120s for openstack, and scrape every 15m [11:07:16] https://www.irccloud.com/pastebin/R2epJR87/ [11:07:58] interesting [11:08:10] I need to go to lunch now, will read later and take another look [11:08:22] thanks! [11:26:46] also the scrape is configured for http://openstack.eqiad1.wikimediacloud.org:12345/metrics [11:34:08] oh, so if the scrape is each 15m the gap in the data may be the scrape interval itself [11:36:05] ok, now discovering this https://gerrit.wikimedia.org/r/c/operations/puppet/+/802434 [11:55:05] that rings a bell yes, iirc openstack get too overloaded and unstable by that exporter (or was getting) [11:55:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/802956 the revert [11:56:15] as far as I remember the revert helped, maybe a.ndrewbogott remembers more [14:03:16] arturo: did the revert of the revert help? [14:10:47] godog: let me check [14:28:18] looks like it! [14:28:28] it == it helped [14:36:37] yes! [14:38:04] \o/ [14:39:41] godog: thanks for the assistance, I'll keep an eye on openstack to see what is the impact of the new scrape interval [14:40:20] sure np! yeah hopefully it isn't too bad/expensive [15:11:20] !log metricsinfra rebooting metricsinfra-prometheus-2 as it was unresponsive [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL [22:49:27] !log removed fullstack-* puppet reports on puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud and cloud-puppetmaster-03.cloudinfra.eqiad.wmflabs to free up disk space [22:49:28] andrewbogott: Unknown project "removed" [22:49:40] !log admin removed fullstack-* puppet reports on puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud and cloud-puppetmaster-03.cloudinfra.eqiad.wmflabs to free up disk space [22:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL