[05:49:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:04:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:58:38] (LVSHighCPU) firing: (7) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [10:02:44] ^ huge connection spike on upload@eqsin: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=45 [10:03:38] (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:09:16] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) [14:18:12] <_joe_> I'm about to merge a couple vcl changes [14:18:17] <_joe_> just FYI [14:19:14] ack [14:29:53] The last Puppet run was at Wed May 18 14:14:28 UTC 2022 (15 minutes ago). Puppet is disabled. vcl change --joe [14:29:55] sigh [14:30:10] _joe_: ETA on that? [14:30:24] <_joe_> vgutierrez: 45 minutes [14:30:29] <_joe_> err 4-5 [14:30:31] <_joe_> :P [14:30:35] ok [14:30:49] <_joe_> I was having connection issues with eqiad [14:30:51] <_joe_> :/ [14:35:21] <_joe_> vgutierrez: free :) [14:35:28] thx [15:39:39] 10Traffic, 10SRE, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) 05Open→03Resolved `(95) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5016].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4... [16:01:26] Have the certspotter systemd timer error emails from the past 12 hours been looked into? [16:02:46] a little bit. it happens from time to time and when I looked the issue was some temp issue pulling from $I_forgot_which_one_but_one_the_cert_providers and then if you start it and try again it just works again [16:04:06] not really new, it happened back in 2020 but only sometimes. https://gerrit.wikimedia.org/r/c/operations/puppet/+/641774 [16:04:34] https://gerrit.wikimedia.org/r/c/operations/puppet/+/428367 [16:05:14] maybe it needs to be repackaged again this time because one of them stopped working ^ [16:08:10] we currently have it running and set to sre-traffic to test it out [16:09:57] sukhe: Are you saying that it *has* been repackaged and that these errors are coming from the newer version? [16:14:39] yes, pretty much the behaviour mutante described, with misbehaving CT logs and such [16:14:51] in the newer version we can filter them out so we have removed some of them [16:15:04] in the previous version, the list of CT logs was fixed so you couldn't change it (or we didn't, either way) [16:15:36] b8237cf, f7c4d65 and 807cc29 are the relevant commits [16:16:21] there is still work to be done here but the current attempt was to restart the service, send it to sre-trafic mailing list (to test it out), fix the issues (such as not reporting on certs actually issued by us) and then send it to the wider SRE team [16:18:19] Makes sense. Thanks! [16:22:52] so that's why I did not get those mails but you did? aha [16:22:58] I saw some of them but from 2020 [16:23:24] where I did see it just the other day though was icinga-wm on IRC directly [17:04:58] mutante: right now the emails (the output from the timer that runs certspotter) goes to just the sre-traffic list, and hence brett and I saw that and not the wider team [17:05:40] the idea was we will test it internally with Traffic to see the frequency of emails we get, etc before we send it to everyone [17:07:54] sukhe: ACK:) sounds good and explains it. I thought for a moment it's because I don't get root mail [17:08:22] ah! [17:11:41] and yeah, there are the icinga-wm alerts too, more recently [17:11:56] which is part of the problem with certspotter and its reliability as of today [17:14:02] the icinga alerts are there because we do the general "systemd status" monitoring [17:14:16] so those resolve when you do "systemctl start certspotter" usually [17:14:27] yeah. just that certspotter is unreliable and fails more than it should and hence the alerts :P [17:14:28] or "systemctl reset-failed" depending if it already got restarted [17:14:56] yea, in theory we can use an eventhandler to say "if this alerts then restart it" [17:15:07] but it's not going to be worth it, probably [17:16:07] yeah I have thought about it and it has been suggested by others too [17:16:33] part of my hesistancy has been knowing that we need to fix certain problems in certspotter itself and I feel that I if do the "reset on failure" or something, I am ignoring the main problem :P [17:16:45] and then actually getting to fixing it. maybe the band-aid solution actually makes sense in the interim [17:20:23] except the band-aid solution is also not really a quick one-liner [17:20:35] but needs to copy the whole setup for the RAID checks [17:20:57] the band aid for the band aid is... you will hate it [17:21:04] just restart it all the time with another time :p [17:21:05] timer [17:21:58] lol [22:36:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) [22:36:43] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) Work was completed on May 4th.