[01:00:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: codfw: rack A8 maintenance 2026-07-01 10:00 am CT - https://phabricator.wikimedia.org/T429856 (10Papaul) 03NEW [02:05:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: rack A8 maintenance 2026-07-01 10:00 am CT - https://phabricator.wikimedia.org/T429856#12043444 (10Papaul) p:05Triage→03Medium [02:24:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12043459 (10Papaul) a:03Papaul [02:25:44] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12043469 (10Papaul) [03:56:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:26:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12043625 (10ayounsi) [07:50:34] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12043882 (10cmooney) >>! In T429773#12043077, @BTullis wrote: > So how about... [08:36:19] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12044038 (10cmooney) [08:38:30] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: High DPI favicon for CAS - https://phabricator.wikimedia.org/T258379#12044042 (10SLyngshede-WMF) 05Open→03In progress a:03SLyngshede-WMF [08:41:29] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12044054 (10ayounsi) Yep! Sounds great! [08:41:48] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725#12044058 (10SLyngshede-WMF) 05Open→03In progress a:03SLyngshede-WMF Preliminary documentation is available here: https://wikitech.wikimedia.org/wiki/CAS-SSO#Configuring_and_... [08:42:10] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725#12044062 (10SLyngshede-WMF) [08:42:14] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841#12044063 (10SLyngshede-WMF) [08:42:58] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725#12044066 (10SLyngshede-WMF) [08:42:59] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236#12044067 (10SLyngshede-WMF) [08:43:00] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841#12044068 (10SLyngshede-WMF) [08:43:30] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236#12044072 (10SLyngshede-WMF) [08:43:31] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841#12044070 (10SLyngshede-WMF) →14Duplicate dup:03T311236 [08:44:03] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: IDP: add policy to release given name - https://phabricator.wikimedia.org/T338214#12044075 (10SLyngshede-WMF) 05In progress→03Resolved [08:49:40] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12044120 (10cmooney) [08:52:13] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: IDP: add policy to release given name - https://phabricator.wikimedia.org/T338214#12044134 (10SLyngshede-WMF) 05Resolved→03In progress Missing one patch. Reopening. [08:54:22] 10netops, 10Cloud-VPS, 06Data-Platform-SRE, 10Data-Services, and 3 others: Plan to make clouddumps more resilient and easier to operate - https://phabricator.wikimedia.org/T411248#12044146 (10brouberol) This will only be used by NFS clients, right? Will it impact humans going to https://dumps.wikimedia.org... [08:55:18] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select data store for webauthn devices - https://phabricator.wikimedia.org/T380173#12044147 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF JPA backend selected. This means that we will not have to deal with synchronizing JSON files or managi... [08:56:56] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12044155 (10cmooney) [08:57:10] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select opt-in method for webauthn - https://phabricator.wikimedia.org/T380178#12044157 (10SLyngshede-WMF) 05Open→03In progress a:03SLyngshede-WMF In many cases WebAuthn will be forced. For others we'll simply utilize the built in attribute mechanism. See:... [08:57:48] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select opt-in method for webauthn - https://phabricator.wikimedia.org/T380178#12044164 (10SLyngshede-WMF) Patch available here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1304784 (We have a ton of tasks on this topic) [09:00:07] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Evaluate supported for trusted devices - https://phabricator.wikimedia.org/T380179#12044179 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF We'll be utilizing FIDO2 devices and enable TOTP once CAS reaches version 8.0.0. We have reported a bug... [09:00:58] 10CAS-SSO, 06Infrastructure-Foundations: Upgrade Apereo CAS to version 7.3 series - https://phabricator.wikimedia.org/T419419#12044185 (10SLyngshede-WMF) 05Open→03Resolved Currently rocking 7.3.7.3 [09:01:54] 10CAS-SSO, 06Infrastructure-Foundations: CAS login page overflows on iOS Safari (iPhone 16e) - https://phabricator.wikimedia.org/T422203#12044193 (10SLyngshede-WMF) 05In progress→03Resolved Sorry, this was fixed and deploy a while back. [11:46:05] 10netops, 10Cloud-VPS, 06Data-Platform-SRE, 10Data-Services, and 3 others: Plan to make clouddumps more resilient and easier to operate - https://phabricator.wikimedia.org/T411248#12044919 (10fgiunchedi) >>! In T411248#12044146, @brouberol wrote: > This will only be used by NFS clients, right? Will it impa... [11:51:13] \o I am trying to do the k8s IPIP thing on our staging cluster, but the test-if-IPIP-works cookbook (sre.loadbalancer.check-ipip) is failing. Knowing ~0 about IPIP, I could use some help in debugging it [11:51:28] I am running `sudo cookbook sre.loadbalancer.check-ipip --dc codfw --query A:ml-staging-master ml-staging-ctrl`, and get at the end: [11:51:35] RuntimeError: ml-staging-ctrl2001.codfw.wmnet is not accepting incoming IPIP traffic: [11:51:37] outer IP header: 172.16.1.1 -> 10.192.16.93 [11:51:39] inner IP header: 10.192.32.49 -> 10.2.1.72 [11:51:41] destination port: 6443 [12:03:56] 10netops, 10Cloud-VPS, 06Data-Platform-SRE, 10Data-Services, and 3 others: Plan to make clouddumps more resilient and easier to operate - https://phabricator.wikimedia.org/T411248#12045019 (10brouberol) No objection from my part! [12:08:36] klausman: I can have a look, otherwise traffic is the best suited for that [12:09:47] klausman: but last time I had this issue a reboot of the host was needed because the firewall was acting up [12:10:14] Is see. I will try roll-rebooting the two masters [12:10:26] klausman: give me a minute first [12:10:31] I'll have a quick look [12:10:36] ack [12:12:38] yup [12:12:40] 2026-06-23T12:11:35.129282+00:00 ml-staging-ctrl2001 ulogd[574]: [fw-in-drop] IN=ens13 OUT= MAC=aa:00:00:bb:2b:bc:64:87:88:f2:6d:a4:08:00 SRC=172.16.1.1 DST=10.192.16.93 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=1 PROTO=4 MARK=0x0 [12:12:46] ml-staging-ctrl2001:~$ sudo tail -f /var/log/ulogd/syslog.log [12:14:34] Alright, will roll-reboot now [12:14:44] try that if it doesn't hurt [12:14:53] and if still not good we can dig deeped [12:15:11] Eh, it's staging, and even for a prod cluster, a roll-reboot of the master nodes usually doesn't cause disruption [12:24:27] Reboot didn't help, still seeing the error in the cookbook and ulogd entries. [12:28:42] alright, probably just an ACL missing then [12:28:48] I think there is more than that [12:29:35] ml-staging-ctrl2001 doesn't have the 'ipip0' or 'ipip60' virtual interfaces it would need to decap ipip packets [12:30:51] Also, femr's rules do not contain any 172.16/20 rules [12:31:21] correction: ip[tables-save doesn't mention them, I do not undertsand ferm well enough to say whether it would(n't) add something [12:31:55] did you make the puppet changes to enable ipip ? [12:32:18] As I understood it, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1294223 should have been enough for the masters [12:32:48] I only know from reading the wikitech right now but that looks correct yep [12:33:52] The ferm stuff in modules/profile/manifests/lvs/realserver/ipip.pp looks more nftables than iptables, but nft list ruleset on ml-staging-ctrl2001 shows nothing. Lotsa iptables rules though [12:34:37] klausman: you're missing include profile::lvs::realserver::ipip in modules/role/manifests/ml_k8s/staging/master.pp [12:35:25] ah. I figured profile::lvs::realserver::ipip::enabled was enough... [12:35:49] I guess valentin added it everywhere but behind a feature flag and forgot that one [12:37:19] ah no, it's the very first step in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/IPIP#Kubernetes_API_(control_plane) :) [12:37:23] `Include profile::lvs::realserver::ipip in the k8s master role` [12:37:56] yeah, I misread that and figured it was referring to the ...:enabled bit [12:38:04] yeah me too [12:38:12] feel free to edit the doc to make it more ibvious [12:38:16] o [12:38:23] I also didn't cop the difference looking at your patch right after reading that page [12:38:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1305110 if anyone wants to +1 ;) [12:40:13] done [12:41:37] merci! [12:47:55] and now it works! Thanks, everyone [12:53:29] nice! [13:05:26] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12045319 (10cmooney) I have assigned [[ https://netbox.wikimedia.org/ipam/pre... [13:13:43] 10netops, 10Cloud-VPS, 06Data-Platform-SRE, 10Data-Services, and 3 others: Plan to make clouddumps more resilient and easier to operate - https://phabricator.wikimedia.org/T411248#12045381 (10Gehel) Moving this to "watching" as I don't expect #data-platform-sre to be involved in the implementation. Ping us... [13:18:04] 07Puppet, 06Infrastructure-Foundations, 10Puppet-Core, 06SRE, 07Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#12045423 (10Aklapper) @Joe: Hi, 9y later, is this still wanted? TIA [14:50:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12046011 (10Papaul) [14:57:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12046036 (10Papaul) [15:16:21] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12046149 (10hashar) Thanks @elukey for the review and merge. Looking at https://docker-registry.wikimedia.org/releng/node22-test-browser/tags/ , the `22.6.0` is still...