[05:17:09] FIRING: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:22:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:27:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:37:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:40:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [08:00:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [08:04:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [08:14:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [08:17:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:47:45] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10672930 (10aborrero) 05Open→03Resolved Sorry this took so long, the person in clinic duty overlooked it. [11:33:10] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673629 (10cmooney) Things are looking good after the application of the change, an-worker nodes are correctly... [12:41:32] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673820 (10BTullis) 05Open→03Resolved a:03BTullis [13:33:23] 06Traffic, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 3 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10674029 (10Mvolz) [14:11:08] 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674206 (10Fabfur) [14:11:42] nice work fabfur! [14:12:05] sssst! let's wait for the end of the tests :D [14:41:45] 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674370 (10Fabfur) [14:45:10] 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674383 (10Fabfur) This change has been applied to cp4047 (currently depooled and silenced due to T387238). All went fine and our checks indicates that the behavior is the expected on... [15:17:54] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 (10cmooney) 03NEW p:05Triage→03Medium [15:20:36] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674586 (10cmooney) @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining steps? Unlike the previous attempt I will lea... [15:44:48] \o g'day. We recently changed a whole bunch of worker IPs in Liftwing-codfw. As such, we believe we need a restart of pybal so the new IPs are picked up/put into IPVS. [15:45:11] klausman: can you link to the change? [15:45:28] https://phabricator.wikimedia.org/T387854 They're all part of this [15:45:38] So it's abouty a dozen individual changes [15:45:54] We reinstalled 11 machines, and 8 of them got VLAN moves [15:46:26] 01-08 moved, 09-11 were new enough to not need that [15:47:01] 2008 hasn't moved yet, but at this point, we only have 01 and 08 listed in ipvsadm [15:49:16] klausman: ok. in the middle of something right now. can I ping you when done? [15:49:22] sure [15:51:04] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674734 (10cmooney) Config to be applied in first step - P74416 [16:32:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.112:443 @ cp4047 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:32:57] klausman: we can do it in 30 mins [16:33:22] roger. By then the last remaining machine should have its new IP [16:33:32] cool [16:37:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.112:443 @ cp4047 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:02:37] klausman: here [17:07:40] Let's go :) [17:09:57] checking what needs to happen and where [17:10:58] klausman: ok, so ml-serve hosts in codfw [17:10:59] starting [17:11:05] Ack [17:11:27] There should be 2001-2011, with 2007 possibly being broken, so that isn't a showstopper [17:11:47] it is pooled? [17:12:14] Luca was reimaging it early and hit a speedbump, but it may have resolved. Even if it is dead-but-pooled, we can proceed. [17:12:23] earlier* [17:13:22] sukhe@lvs2014:~$ curl localhost:9090/pools/k8s-ingress-ml-serve_31443 [17:13:23] ml-serve2008.codfw.wmnet: enabled/up/pooled [17:13:23] ml-serve2003.codfw.wmnet: enabled/up/pooled [17:13:23] ml-serve2007.codfw.wmnet: enabled/down/not pooled [17:13:23] ml-serve2005.codfw.wmnet: enabled/up/pooled [17:13:25] ml-serve2006.codfw.wmnet: enabled/up/pooled [17:13:27] ml-serve2004.codfw.wmnet: enabled/up/pooled [17:13:30] ml-serve2001.codfw.wmnet: enabled/up/pooled [17:13:32] ml-serve2002.codfw.wmnet: enabled/up/pooled [17:13:35] looks good [17:13:38] at least, matches up [17:13:49] can you confirm this list is OK? [17:13:53] yes [17:14:01] I can't comment on the backend server bit without checking further but I can check [17:14:03] ok [17:14:14] I'll figure out 2007 with Luca [17:14:35] yeah it's in the right state as far as we are concerned, so no worries [17:15:51] klausman: all done and looks good. [17:15:56] ty! [17:24:07] https://config-master.wikimedia.org/pybal/codfw/inference <- 2009-11 are missing here, did I forget a step somewhere? [17:24:54] My bad for not spotting that earlier [17:25:08] no worries [17:25:09] let's look [17:25:33] sukhe@puppetserver1001:~$ sudo confctl select 'name=ml-serve2011.codfw.wmnet' get [17:25:36] {"ml-serve2011.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_serve,service=kubesvc"} [17:25:45] host is inactive, so pool the ones you want [17:25:51] ack [17:26:37] and similarly 2009 and 2010, weight are 0 and marked inactive [17:27:21] https://config-master.wikimedia.org/pybal/codfw/inference [17:30:05] looks good I think but you be the judge! [17:35:01] ack, thanks again [17:37:55] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team: Log Api-User-Agent header in Turnilo - https://phabricator.wikimedia.org/T373871#10675270 (10Milimetric) This should've been tagged Data-Engineering, fixing [17:41:52] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 07OKR-Work: Log Api-User-Agent header in Turnilo - https://phabricator.wikimedia.org/T373871#10675291 (10Ottomata) This should fit into work for [[ https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2025-2026/Product_%26_Technology_OKRs#Res... [17:58:44] 06Traffic: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978#10675425 (10BCornwall) [17:58:56] 06Traffic: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978#10675426 (10BCornwall) 05Open→03In progress p:05Triage→03High