[01:01:49] <wikibugs>	 10Acme-chief, 10Patch-Needs-Improvement: Check for expired/outdated certs in the main loop - https://phabricator.wikimedia.org/T207374 (10Pppery)
[08:12:12] <wikibugs>	 10Traffic, 10SRE, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) cp4044 was missing the haproxy 2.6.9 in /var/cache/apt/archives and 2.6.11 has been released, so I'm testing it there right now
[10:54:32] <wikibugs>	 10Domains, 10Traffic, 10SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Aklapper) > I suspect it's used in *some* channels to refer to the *English* Wikipedia, but people could just...stop doing that?   I don't see a good reason to potentially end up with lots of `LANGUAGECODEwp.TLD` sty...
[10:57:45] <jinxer-wm>	 (HAProxyRestarted) firing: (8) HAProxy server restarted on cp4037:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[11:01:17] <vgutierrez>	 the new alert works as expected :)
[11:02:41] <godog>	 \o/ \o/
[11:04:04] <vgutierrez>	 dashboard link is wrong though
[11:04:36] <vgutierrez>	 site variable expects "$DC prometheus/ops" rather than "$DC"
[11:10:23] <vgutierrez>	 https://gerrit.wikimedia.org/r/c/operations/alerts/+/902319/ quick fix :)
[11:11:00] <vgutierrez>	 CI doesn't agreee of course
[11:11:47] <godog>	 can't escape the stern look of jenkins
[11:13:11] <vgutierrez>	 yeah... %20 isn't kosher apparently
[11:13:39] <vgutierrez>	 but is currently being used in some dashboards already
[11:13:40] <vgutierrez>	 :/
[11:48:15] <jinxer-wm>	 (HAProxyRestarted) resolved: (2) HAProxy server restarted on cp4041:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[11:58:03] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) p:05High→03Medium https://github.com/haproxy/haproxy/commit/407210a34d781f8249504557c371c170cb34f93e introduced in HAProxy 2.6.10 has been identified...
[12:47:57] <wikibugs>	 10Traffic, 10SRE, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) 05Open→03Resolved alerting is in place to avoid this kind of issue upon HAProxy updates in the future and everything is down to 2.6.9: ` vgutierrez@cumin1001:~$ sudo -i cum...
[13:55:34] <wikibugs>	 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye
[14:21:45] <wikibugs>	 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye completed: - pybal-test2003 (**PASS**)   - Downtimed on Icinga/Alertmanage...
[14:31:59] <wikibugs>	 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Reimaged `pybal-test2003` to bullseye, added `component/pybal` and everything appears to be fine with the installation.  ` pybal:   Installed: 1.15.10+deb11u1   Candidate: 1.15.10+deb11u1   Version table:...
[15:19:30] <wikibugs>	 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)
[15:20:51] <wikibugs>	 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite)
[15:43:57] <jinxer-wm>	 (PurgedHighEventLag) firing: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[15:45:16] <vgutierrez>	 hmm latency to kafka@eqiad spiking over 50 seconds
[15:46:10] <sukhe>	 hmm
[15:47:49] <vgutierrez>	 that's impacting eqsin, not just 5022
[15:48:08] <vgutierrez>	 both text & upload
[15:48:57] <jinxer-wm>	 (PurgedHighEventLag) firing: (8) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[15:49:20] <vgutierrez>	 8 packets transmitted, 7 received, 12.5% packet loss, time 7007ms
[15:49:29] <vgutierrez>	 4 packets transmitted, 3 received, 25% packet loss, time 3000ms
[15:49:30] <vgutierrez>	 not great :)
[15:50:24] <vgutierrez>	 https://grafana.wikimedia.org/goto/6gMrOOfVk?orgId=1
[15:51:08] <vgutierrez>	 XioNoX, topranks sorry to disrupt your sprint week ^^
[15:51:21] <vgutierrez>	 I'm not seeing anything on the maintenance calendar
[15:51:58] <topranks>	 vgutierrez: looking 
[15:53:12] <vgutierrez>	 something happened around 15:30
[15:53:55] <vgutierrez>	 and eqsin is hurting... p75 on text for caches miss|pass https://grafana.wikimedia.org/goto/hACzFdfVk?orgId=1
[15:53:57] <jinxer-wm>	 (PurgedHighEventLag) resolved: (5) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[15:55:42] <topranks>	 vgutierrez: on the face of it not seeing issues across the WAN
[15:55:48] <topranks>	 https://phabricator.wikimedia.org/P45933
[15:55:55] <topranks>	 I'll open a task and do some more checks 
[15:57:24] <vgutierrez>	 it seems to be recovering
[15:57:35] <vgutierrez>	 just 30 minutes of degraded service
[15:57:38] <vgutierrez>	 :_)
[15:57:52] <topranks>	 hmm
[15:58:04] <topranks>	 where did you run those pings you posted the results from above?
[15:59:05] <vgutierrez>	 https://www.irccloud.com/pastebin/ECLHEdpy/
[15:59:21] <vgutierrez>	 that's the raw data
[16:00:49] <topranks>	 thanks
[16:01:38] <vgutierrez>	 kafka latency is back to normal too
[16:01:50] <topranks>	 hmm yeah it must have been something with one of the transport links 
[16:03:31] <sukhe>	 we have seen these in the past too though I guess for shorter durations
[16:03:44] <topranks>	 big spike in usage over the main one that's in use there I notice 
[16:03:45] <topranks>	 https://librenms.wikimedia.org/graphs/to=1679587200/id=13968/type=port_bits/from=1679500800/
[16:05:43] <vgutierrez>	 we didn't have a spike in RPS
[16:06:15] <topranks>	 yeah
[16:06:29] <topranks>	 no jump in usage on the internet circuits there either, so doesn't look like externally triggered DOS event 
[16:07:25] <vgutierrez>	 topranks: traffic spike though in upload: https://grafana.wikimedia.org/goto/BL3_FdBVk?orgId=1
[16:07:46] <topranks>	 usage on the transport link from eqsin to codfw is still very high 
[16:08:17] <vgutierrez>	 visible here too: https://grafana.wikimedia.org/goto/vFXrFOB4z?orgId=1
[16:08:37] <vgutierrez>	 topranks: that's ATS going to the appservers
[16:08:55] <topranks>	 yep, that would seem to be it alright, and still climbing :(
[16:37:22] <_joe_>	 vgutierrez, sukhe: is any of you available to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/901333 ?
[16:37:37] <_joe_>	 It seems like an important bugfix
[16:38:48] <vgutierrez>	 I'm wondering why an ATS restart is required though
[16:38:55] <sukhe>	 _joe_: happy to take care of it
[16:39:00] <sukhe>	 if vgutierrez is fine with it :)
[16:39:04] <vgutierrez>	 sure
[16:39:21] <_joe_>	 vgutierrez: we are changing the lua code, not sure if a restart is needed
[16:39:35] <vgutierrez>	 it shouldn't
[16:39:49] <vgutierrez>	 unless you wanna be super sure that no thread with the old code stays alive
[16:40:16] <vgutierrez>	 considering OAuth is involved better safe than sorry
[16:40:57] <sukhe>	 there is a comment from tstarling below that says restart as well fwiw (doesn't mention the context)
[16:41:11] <vgutierrez>	 that's why I was wondering about the restarts :)
[20:01:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt)
[23:51:44] <jinxer-wm>	 (VarnishHighThreadCount) firing: (3) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[23:56:44] <jinxer-wm>	 (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount