[05:49:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[06:04:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[09:58:38] <jinxer-wm>	 (LVSHighCPU) firing: (7) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[10:02:44] <vgutierrez>	 ^ huge connection spike on upload@eqsin: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=45
[10:03:38] <jinxer-wm>	 (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[14:09:16] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey)
[14:18:12] <_joe_>	 I'm about to merge a couple vcl changes
[14:18:17] <_joe_>	 just FYI
[14:19:14] <vgutierrez>	 ack
[14:29:53] <vgutierrez>	 The last Puppet run was at Wed May 18 14:14:28 UTC 2022 (15 minutes ago). Puppet is disabled. vcl change --joe
[14:29:55] <vgutierrez>	 sigh
[14:30:10] <vgutierrez>	 _joe_: ETA on that?
[14:30:24] <_joe_>	 vgutierrez: 45 minutes
[14:30:29] <_joe_>	 err 4-5
[14:30:31] <_joe_>	 :P
[14:30:35] <vgutierrez>	 ok
[14:30:49] <_joe_>	 I was having connection issues with eqiad
[14:30:51] <_joe_>	 :/
[14:35:21] <_joe_>	 vgutierrez: free :)
[14:35:28] <vgutierrez>	 thx
[15:39:39] <wikibugs>	 10Traffic, 10SRE, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) 05Open→03Resolved `(95) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5016].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4...
[16:01:26] <brett>	 Have the certspotter systemd timer error emails from the past 12 hours been looked into?
[16:02:46] <mutante>	 a little bit. it happens from time to time and when I looked the issue was some temp issue pulling from $I_forgot_which_one_but_one_the_cert_providers and then if you start it and try again it just works again
[16:04:06] <mutante>	 not really new, it happened back in 2020 but only sometimes.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/641774
[16:04:34] <mutante>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/428367  
[16:05:14] <mutante>	 maybe it needs to be repackaged again this time because one of them stopped working ^
[16:08:10] <sukhe>	 we currently have it running and set to sre-traffic to test it out
[16:09:57] <brett>	 sukhe: Are you saying that it *has* been repackaged and that these errors are coming from the newer version?
[16:14:39] <sukhe>	 yes, pretty much the behaviour mutante described, with misbehaving CT logs and such
[16:14:51] <sukhe>	 in the newer version we can filter them out so we have removed some of them
[16:15:04] <sukhe>	 in the previous version, the list of CT logs was fixed so you couldn't change it (or we didn't, either way)
[16:15:36] <sukhe>	 b8237cf, f7c4d65 and 807cc29 are the relevant commits
[16:16:21] <sukhe>	 there is still work to be done here but the current attempt was to restart the service, send it to sre-trafic mailing list (to test it out), fix the issues (such as not reporting on certs actually issued by us) and then send it to the wider SRE team
[16:18:19] <brett>	 Makes sense. Thanks!
[16:22:52] <mutante>	 so that's why I did not get those mails but you did? aha
[16:22:58] <mutante>	 I saw some of them but from 2020
[16:23:24] <mutante>	 where I did see it just the other day though was icinga-wm on IRC directly
[17:04:58] <sukhe>	 mutante: right now the emails (the output from the timer that runs certspotter) goes to just the sre-traffic list, and hence brett and I saw that and not the wider team
[17:05:40] <sukhe>	 the idea was we will test it internally with Traffic to see the frequency of emails we get, etc before we send it to everyone
[17:07:54] <mutante>	 sukhe: ACK:) sounds good and explains it. I thought for a moment it's because I don't get root mail
[17:08:22] <sukhe>	 ah!
[17:11:41] <sukhe>	 and yeah, there are the icinga-wm alerts too, more recently
[17:11:56] <sukhe>	 which is part of the problem with certspotter and its reliability as of today
[17:14:02] <mutante>	 the icinga alerts are there because we do the general "systemd status" monitoring
[17:14:16] <mutante>	 so those resolve when you do "systemctl start certspotter" usually
[17:14:27] <sukhe>	 yeah. just that certspotter is unreliable and fails more than it should and hence the alerts :P
[17:14:28] <mutante>	 or "systemctl reset-failed" depending if it already got restarted
[17:14:56] <mutante>	 yea, in theory we can use an eventhandler to say "if this alerts then restart it" 
[17:15:07] <mutante>	 but it's not going to be worth it, probably
[17:16:07] <sukhe>	 yeah I have thought about it and it has been suggested by others too
[17:16:33] <sukhe>	 part of my hesistancy has been knowing that we need to fix certain problems in certspotter itself and I feel that I if do the "reset on failure" or something, I am ignoring the main problem :P
[17:16:45] <sukhe>	 and then actually getting to fixing it. maybe the band-aid solution actually makes sense in the interim
[17:20:23] <mutante>	 except the band-aid solution is also not really a quick one-liner
[17:20:35] <mutante>	 but needs to copy the whole setup for the RAID checks
[17:20:57] <mutante>	 the band aid for the band aid is... you will hate it
[17:21:04] <mutante>	 just restart it all the time with another time :p
[17:21:05] <mutante>	 timer
[17:21:58] <sukhe>	 lol
[22:36:28] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH)
[22:36:43] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) Work was completed on May 4th.