[09:29:28] <dcaro>	 morning
[09:50:19] <dcaro>	 dhinus: it seems it stabilized a bit more, I had a quick look at the last logs for one of the latest failures (today at ~00:54 UTC) and I can see the gap in the apache logs, so probably the request did not get to apache (that points to network, either the system being overloaded and not able to handle the request, or the request not reaching)
[09:50:43] <dcaro>	 https://www.irccloud.com/pastebin/b1PfCbrk/
[09:51:13] <dcaro>	 there's a gap between 00:53 and 00:55, and 00:55 and 00:57 (it gets usually >1 request/min)
[09:52:47] <dcaro>	 sssd is flapping in that machine, though it flaps regularly during the day, so probably unrelated:
[09:52:47] <dcaro>	 `Jan 20 00:59:50 tools-legacy-redirector-2 sssd_nss[348819]: Shutting down (status = 0)`
[09:52:47] <dcaro>	 `Jan 20 00:59:50 tools-legacy-redirector-2 systemd[1]: sssd-nss.service: Deactivated successfully.`
[09:54:33] <dhinus>	 dcaro: thanks. I thought of network too, but the increase in the load graph seems suspicious
[09:55:11] <dhinus>	 so my next thought was something on the hypervisor level, that's why I tried migrating it
[09:55:41] <dhinus>	 do you know the right way to migrate it? I tried following https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#Live_Migrating_Virtual_Machines but that didn't work
[09:56:29] <dcaro>	 did you 'source novaenv.sh'?V
[09:56:39] <dhinus>	 one thing I did not check is if there was any spike in requests
[09:57:23] <dcaro>	 by hand I did not notice any extra spike, on that vhost at least (it actually has a gap of no request for a minute)
[09:57:51] <dcaro>	 but may be another vhost that was getting the load
[09:57:52] <dhinus>	 dcaro: nope, maybe I just missed that! I was also unsure whether I should run the command from cloudcontrol or the cloudvirt itself
[09:59:02] <dhinus>	 yep from my bash history in cloudcontrol I think I stupidly didn't think I needed wmcs-openstack or "source novaenv" :facepalm:
[09:59:37] <dcaro>	 xd
[10:21:20] <dhinus>	 there is a disk space alert on cloudcontrol1005, it's down to a 15G log file /var/log/apache2/other_vhosts_access-json.log.1
[10:21:39] <dhinus>	 is someone hammering the openstack api?
[10:31:15] <dhinus>	 found this API dashboard but it's broken :/ https://grafana-rw.wikimedia.org/d/tanisM2Zz/wmcs-openstack-api-stats-eqiad1
[10:45:42] <dcaro>	 hmm... I thought the dashboards had been fixed not long ago :/
[10:45:59] <dhinus>	 I'm looking into it, I think it's looking at the wrong data source
[10:46:11] <dhinus>	 I can find the metrics if I select "eqiad prometheus/labs"
[10:47:52] <dhinus>	 hmm not actually true, or not for all graphs
[10:48:04] <dcaro>	 what about thanos?
[10:48:37] <dhinus>	 I can find something from "explore" but not in the dashboard itself
[10:49:01] <dcaro>	 I think that it might have changed with some upgrade of haproxy
[10:49:19] <dcaro>	 this old task might be related https://phabricator.wikimedia.org/T343885
[10:49:25] <dhinus>	 probably the metric names changed
[10:54:12] <dcaro>	 yep, I think that for the backend status it's now haproxy_backend_status (it was *_up)k
[10:54:49] <dcaro>	 I'll let you edit it :)
[10:56:41] <dhinus>	 the task seems to imply the old metrics should still be available
[10:56:47] <dhinus>	 but maybe they were removed at some point?
[10:57:21] <dhinus>	 probably in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076297
[10:59:45] <dcaro>	 yep xd "might break some dashboards or similar (did not check)"
[11:09:44] <dhinus>	 it's quite a lot of work because the variables must be changed too
[11:09:50] <dhinus>	 I'll open a subtask
[11:10:05] <dhinus>	 do we have other dashboards where I can check if there was a spike in API requests?
[11:12:49] <dcaro>	 maybe https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency?orgId=1&refresh=30s
[11:15:16] <dhinus>	 yes, I wonder if it includes all types of requests, or if some are missing
[11:15:28] <dcaro>	 three easy reviews: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests?label_name%5B%5D=Needs+review
[11:15:48] <dcaro>	 I think it should include everything there, it's using the same haproxy requests
[11:16:11] <dhinus>	 well I hacked one query removing "proxy=" so that it shows the sum of all proxies
[11:16:18] <dhinus>	 and I don't see any spike today
[11:16:47] <dcaro>	 and friday?
[11:16:55] <dcaro>	 (the flakiness was way worse then)
[11:17:26] <dhinus>	 I was trying to explain the disk space alert (huge apache log file), not the flakiness :)
[11:18:11] <dcaro>	 oh, true xd, then yep
[11:18:43] <dhinus>	 I can try doing some grepping on the log file #oldschoolobservability :)
[11:19:17] <dcaro>	 it might also be in logstash, might give a hint on the most common fields
[11:19:23] <dhinus>	 true
[11:19:41] * dcaro lunch
[11:20:13] <dcaro>	 I want to have a quick look at the alerts, as I see one in the prometheus-alerts.wmcloud.o that's not in alerts.w.o
[11:21:26] <dhinus>	 which one?
[11:21:55] <dhinus>	 PuppetCertificateAboutToExpire?
[12:21:23] <dcaro>	 yep, that one
[12:35:57] <dcaro>	 ohhh, I think that the haproxy is balancing on both alertmanagers, and one of them returns stuff, the other does not
[12:53:21] <dcaro>	 hmm, the prometheus is unable to reach the alertmanager-3, tcp issues, looking at security groups
[12:53:25] <dcaro>	 https://www.irccloud.com/pastebin/icL2grs9/
[12:58:48] <dcaro>	 weird, the security groups look ok :/
[12:58:58] <dcaro>	 I'll try stopping, then starting the instance
[13:02:50] <dcaro>	 that seemed to do the trick
[18:21:16] * dcaro off