[05:17:09] <jinxer-wm>	 FIRING: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[05:22:09] <jinxer-wm>	 RESOLVED: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[07:27:00] <jinxer-wm>	 FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[07:37:00] <jinxer-wm>	 RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[07:40:00] <jinxer-wm>	 FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[08:00:00] <jinxer-wm>	 RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[08:04:00] <jinxer-wm>	 FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[08:14:00] <jinxer-wm>	 RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[08:17:00] <jinxer-wm>	 FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[09:47:45] <wikibugs>	 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10672930 (10aborrero) 05Open→03Resolved Sorry this took so long, the person in clinic duty overlooked it.
[11:33:10] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673629 (10cmooney) Things are looking good after the application of the change, an-worker nodes are correctly...
[12:41:32] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673820 (10BTullis) 05Open→03Resolved a:03BTullis
[13:33:23] <wikibugs>	 06Traffic, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 3 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10674029 (10Mvolz)
[14:11:08] <wikibugs>	 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674206 (10Fabfur)
[14:11:42] <sukhe>	 nice work fabfur!
[14:12:05] <fabfur>	 sssst! let's wait for the end of the tests :D 
[14:41:45] <wikibugs>	 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674370 (10Fabfur)
[14:45:10] <wikibugs>	 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10674383 (10Fabfur) This change has been applied to cp4047 (currently depooled and silenced due to T387238). All went fine and our checks indicates that the behavior is the expected on...
[15:17:54] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 (10cmooney) 03NEW p:05Triage→03Medium
[15:20:36] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674586 (10cmooney) @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining steps?  Unlike the previous attempt I will lea...
[15:44:48] <klausman>	 \o g'day. We recently changed a whole bunch of worker IPs in Liftwing-codfw. As such, we believe we need a restart of pybal so the new IPs are picked up/put into IPVS.
[15:45:11] <sukhe>	 klausman: can you link to the change?
[15:45:28] <klausman>	 https://phabricator.wikimedia.org/T387854 They're all part of this
[15:45:38] <klausman>	 So it's abouty a dozen individual changes
[15:45:54] <klausman>	 We reinstalled 11 machines, and 8 of them got VLAN moves
[15:46:26] <klausman>	 01-08 moved, 09-11 were new enough to not need that
[15:47:01] <klausman>	 2008 hasn't moved yet, but at this point, we only have 01 and 08 listed in ipvsadm
[15:49:16] <sukhe>	 klausman: ok. in the middle of something right now. can I ping you when done?
[15:49:22] <klausman>	 sure
[15:51:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674734 (10cmooney) Config to be applied in first step - P74416
[16:32:38] <jinxer-wm>	 FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.112:443 @ cp4047 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[16:32:57] <sukhe>	 klausman: we can do it in 30 mins
[16:33:22] <klausman>	 roger. By then the last remaining machine should have its new IP
[16:33:32] <sukhe>	 cool
[16:37:38] <jinxer-wm>	 RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.112:443 @ cp4047 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[17:02:37] <sukhe>	 klausman: here 
[17:07:40] <klausman>	 Let's go :)
[17:09:57] <sukhe>	 checking what needs to happen and where
[17:10:58] <sukhe>	 klausman: ok, so ml-serve hosts in codfw
[17:10:59] <sukhe>	 starting
[17:11:05] <klausman>	 Ack
[17:11:27] <klausman>	 There should be 2001-2011, with 2007 possibly being broken, so that isn't a showstopper
[17:11:47] <sukhe>	 it is pooled?
[17:12:14] <klausman>	 Luca was reimaging it early and hit a speedbump, but it may have resolved. Even if it is dead-but-pooled, we can proceed.
[17:12:23] <klausman>	 earlier*
[17:13:22] <sukhe>	 sukhe@lvs2014:~$ curl localhost:9090/pools/k8s-ingress-ml-serve_31443
[17:13:23] <sukhe>	 ml-serve2008.codfw.wmnet:	enabled/up/pooled
[17:13:23] <sukhe>	 ml-serve2003.codfw.wmnet:	enabled/up/pooled
[17:13:23] <sukhe>	 ml-serve2007.codfw.wmnet:	enabled/down/not pooled
[17:13:23] <sukhe>	 ml-serve2005.codfw.wmnet:	enabled/up/pooled
[17:13:25] <sukhe>	 ml-serve2006.codfw.wmnet:	enabled/up/pooled
[17:13:27] <sukhe>	 ml-serve2004.codfw.wmnet:	enabled/up/pooled
[17:13:30] <sukhe>	 ml-serve2001.codfw.wmnet:	enabled/up/pooled
[17:13:32] <sukhe>	 ml-serve2002.codfw.wmnet:	enabled/up/pooled
[17:13:35] <sukhe>	 looks good
[17:13:38] <sukhe>	 at least, matches up
[17:13:49] <sukhe>	 can you confirm this list is OK?
[17:13:53] <klausman>	 yes
[17:14:01] <sukhe>	 I can't comment on the backend server bit without checking further but I can check
[17:14:03] <sukhe>	 ok
[17:14:14] <klausman>	 I'll figure out 2007 with Luca
[17:14:35] <sukhe>	 yeah it's in the right state as far as we are concerned, so no worries
[17:15:51] <sukhe>	 klausman: all done and looks good.
[17:15:56] <klausman>	 ty!
[17:24:07] <klausman>	 https://config-master.wikimedia.org/pybal/codfw/inference <- 2009-11 are missing here, did I forget a step somewhere?
[17:24:54] <klausman>	 My bad for not spotting that earlier
[17:25:08] <sukhe>	 no worries
[17:25:09] <sukhe>	 let's look
[17:25:33] <sukhe>	 sukhe@puppetserver1001:~$ sudo confctl select 'name=ml-serve2011.codfw.wmnet' get
[17:25:36] <sukhe>	 {"ml-serve2011.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_serve,service=kubesvc"}
[17:25:45] <sukhe>	 host is inactive, so pool the ones you want
[17:25:51] <klausman>	 ack
[17:26:37] <sukhe>	 and similarly 2009 and 2010, weight are 0 and marked inactive
[17:27:21] <sukhe>	 https://config-master.wikimedia.org/pybal/codfw/inference
[17:30:05] <sukhe>	 looks good I think but you be the judge!
[17:35:01] <klausman>	 ack, thanks again
[17:37:55] <wikibugs>	 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team: Log Api-User-Agent header in Turnilo - https://phabricator.wikimedia.org/T373871#10675270 (10Milimetric) This should've been tagged Data-Engineering, fixing
[17:41:52] <wikibugs>	 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 07OKR-Work: Log Api-User-Agent header in Turnilo - https://phabricator.wikimedia.org/T373871#10675291 (10Ottomata) This should fit into work for [[ https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2025-2026/Product_%26_Technology_OKRs#Res...
[17:58:44] <wikibugs>	 06Traffic: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978#10675425 (10BCornwall)
[17:58:56] <wikibugs>	 06Traffic: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978#10675426 (10BCornwall) 05Open→03In progress p:05Triage→03High