[01:01:49] 10Acme-chief, 10Patch-Needs-Improvement: Check for expired/outdated certs in the main loop - https://phabricator.wikimedia.org/T207374 (10Pppery) [08:12:12] 10Traffic, 10SRE, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) cp4044 was missing the haproxy 2.6.9 in /var/cache/apt/archives and 2.6.11 has been released, so I'm testing it there right now [10:54:32] 10Domains, 10Traffic, 10SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Aklapper) > I suspect it's used in *some* channels to refer to the *English* Wikipedia, but people could just...stop doing that? I don't see a good reason to potentially end up with lots of `LANGUAGECODEwp.TLD` sty... [10:57:45] (HAProxyRestarted) firing: (8) HAProxy server restarted on cp4037:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:01:17] the new alert works as expected :) [11:02:41] \o/ \o/ [11:04:04] dashboard link is wrong though [11:04:36] site variable expects "$DC prometheus/ops" rather than "$DC" [11:10:23] https://gerrit.wikimedia.org/r/c/operations/alerts/+/902319/ quick fix :) [11:11:00] CI doesn't agreee of course [11:11:47] can't escape the stern look of jenkins [11:13:11] yeah... %20 isn't kosher apparently [11:13:39] but is currently being used in some dashboards already [11:13:40] :/ [11:48:15] (HAProxyRestarted) resolved: (2) HAProxy server restarted on cp4041:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:58:03] 10Traffic, 10SRE, 10Patch-For-Review, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) p:05High→03Medium https://github.com/haproxy/haproxy/commit/407210a34d781f8249504557c371c170cb34f93e introduced in HAProxy 2.6.10 has been identified... [12:47:57] 10Traffic, 10SRE, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) 05Open→03Resolved alerting is in place to avoid this kind of issue upon HAProxy updates in the future and everything is down to 2.6.9: ` vgutierrez@cumin1001:~$ sudo -i cum... [13:55:34] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye [14:21:45] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye completed: - pybal-test2003 (**PASS**) - Downtimed on Icinga/Alertmanage... [14:31:59] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Reimaged `pybal-test2003` to bullseye, added `component/pybal` and everything appears to be fine with the installation. ` pybal: Installed: 1.15.10+deb11u1 Candidate: 1.15.10+deb11u1 Version table:... [15:19:30] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [15:20:51] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [15:43:57] (PurgedHighEventLag) firing: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:45:16] hmm latency to kafka@eqiad spiking over 50 seconds [15:46:10] hmm [15:47:49] that's impacting eqsin, not just 5022 [15:48:08] both text & upload [15:48:57] (PurgedHighEventLag) firing: (8) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:49:20] 8 packets transmitted, 7 received, 12.5% packet loss, time 7007ms [15:49:29] 4 packets transmitted, 3 received, 25% packet loss, time 3000ms [15:49:30] not great :) [15:50:24] https://grafana.wikimedia.org/goto/6gMrOOfVk?orgId=1 [15:51:08] XioNoX, topranks sorry to disrupt your sprint week ^^ [15:51:21] I'm not seeing anything on the maintenance calendar [15:51:58] vgutierrez: looking [15:53:12] something happened around 15:30 [15:53:55] and eqsin is hurting... p75 on text for caches miss|pass https://grafana.wikimedia.org/goto/hACzFdfVk?orgId=1 [15:53:57] (PurgedHighEventLag) resolved: (5) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:55:42] vgutierrez: on the face of it not seeing issues across the WAN [15:55:48] https://phabricator.wikimedia.org/P45933 [15:55:55] I'll open a task and do some more checks [15:57:24] it seems to be recovering [15:57:35] just 30 minutes of degraded service [15:57:38] :_) [15:57:52] hmm [15:58:04] where did you run those pings you posted the results from above? [15:59:05] https://www.irccloud.com/pastebin/ECLHEdpy/ [15:59:21] that's the raw data [16:00:49] thanks [16:01:38] kafka latency is back to normal too [16:01:50] hmm yeah it must have been something with one of the transport links [16:03:31] we have seen these in the past too though I guess for shorter durations [16:03:44] big spike in usage over the main one that's in use there I notice [16:03:45] https://librenms.wikimedia.org/graphs/to=1679587200/id=13968/type=port_bits/from=1679500800/ [16:05:43] we didn't have a spike in RPS [16:06:15] yeah [16:06:29] no jump in usage on the internet circuits there either, so doesn't look like externally triggered DOS event [16:07:25] topranks: traffic spike though in upload: https://grafana.wikimedia.org/goto/BL3_FdBVk?orgId=1 [16:07:46] usage on the transport link from eqsin to codfw is still very high [16:08:17] visible here too: https://grafana.wikimedia.org/goto/vFXrFOB4z?orgId=1 [16:08:37] topranks: that's ATS going to the appservers [16:08:55] yep, that would seem to be it alright, and still climbing :( [16:37:22] <_joe_> vgutierrez, sukhe: is any of you available to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/901333 ? [16:37:37] <_joe_> It seems like an important bugfix [16:38:48] I'm wondering why an ATS restart is required though [16:38:55] _joe_: happy to take care of it [16:39:00] if vgutierrez is fine with it :) [16:39:04] sure [16:39:21] <_joe_> vgutierrez: we are changing the lua code, not sure if a restart is needed [16:39:35] it shouldn't [16:39:49] unless you wanna be super sure that no thread with the old code stays alive [16:40:16] considering OAuth is involved better safe than sorry [16:40:57] there is a comment from tstarling below that says restart as well fwiw (doesn't mention the context) [16:41:11] that's why I was wondering about the restarts :) [20:01:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt) [23:51:44] (VarnishHighThreadCount) firing: (3) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:56:44] (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount