[04:06:15] 06Traffic, 06Data-Platform-SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Provide a scheduled data download service from Google Cloud Storage - https://phabricator.wikimedia.org/T427457#12027563 (10Ahoelzl) @Antoine_Quhen newly has access to Google Cloud and can configure granular data access go... [08:28:14] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, 06Machine-Learning-Team (Q4 FY2025-26): Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#12028075 (10gkyziridis) ==== Qwen3.6-27B-FP8 Deployment Plan ===== Since the vllm image is upgraded and deploy... [08:49:24] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 4 others: Increase visibility of kubernetes network status - https://phabricator.wikimedia.org/T356877#12028173 (10JMeybohm) [09:17:14] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team, 06cloud-services-team (FY2025/2026-Q3-Q4): Establish a blackbox network probe vantage point into cloud realm - https://phabricator.wikimedia.org/T429451#12028306 (10fgiunchedi) cc #netops too for their input on the idea [11:47:47] 10netops, 06Infrastructure-Foundations, 06SRE: Create a cookbook to add tagged_vlans to cloudsw ports - https://phabricator.wikimedia.org/T429466 (10cmooney) 03NEW p:05Triage→03Low [13:05:44] brett: I have rolled out the uncommitted changes on the CR routers in eqsin to peer with the new dns5004 IPs [13:05:51] the old ones are unreachable so it seemed safe [13:07:33] sessions have no established ok [13:07:40] https://www.irccloud.com/pastebin/BhXbXDB9/ [13:10:11] topranks: thank you for doing that! [13:10:58] np, well I was rolling out something else and it was in the diff [13:11:00] things look ok [13:11:02] https://grafana-rw.wikimedia.org/d/Jj8MztfZz/authoritative-dns?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=000000026&var-server=dns5004&refresh=30s [13:12:17] we should keep this in mind for the lvs work as well (today) [13:12:51] (we -> Traffic) [13:13:44] yep. you can set the bgp flag in Netbox to "false" also, which will mean homer won't try to configure the peering on the CRs [13:46:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029369 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0f25bf0f-a791-457b-ad82-68cc6bf09194) set by pt1979@cumin2002 for 1:00:0... [13:47:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=113ab0a6-249c-4a59-a4d0-49f9f85ef5d6) set by pt1979@cumin2002 for 1:00:0... [13:55:55] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488 (10cmooney) 03NEW p:05Triage→03Low [14:16:07] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029587 (10cmooney) [14:16:17] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029589 (10cmooney) [14:16:50] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029594 (10BTullis) Thanks very much. I can see this as being extremely useful for us in #data-platform-sre [14:16:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029593 (10Papaul) I Changed back the configuration on mr1-codfw for the irb-900 interface since a reboot is needed. I will schedule a maintena... [14:18:33] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029612 (10ayounsi) Yep it's a good idea, but I think we will soon be there ! For decom: https://gerrit.wikimedia.org/r/c/operatio... [14:28:17] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#12029648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1003 for host dns7002.wikimedia.org with OS trixie [14:38:53] 06Traffic, 10Maps, 06SRE: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191#12029719 (10ssingh) @MSantos: this needs your approval. [15:15:14] 10netops, 06Infrastructure-Foundations, 06SRE: SR-Linux: applying analytics-in acl to irb sub-interface blocks ARP - https://phabricator.wikimedia.org/T429499 (10cmooney) 03NEW p:05Triage→03High [15:17:16] Thanks, topranks! My b on not checking the graphs [15:49:51] 06Traffic, 06Infrastructure-Foundations, 06SRE: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12030100 (10CDanis) There's a hidden Option 4 here, which is to declare that urldownloader would be the first Sophroid-only service, only accessible via t... [16:11:55] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs5005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:55] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on dns7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:25] FIRING: SystemdUnitCrashLoop: pdns-recursor.service crashloop on dns7002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:03:48] FIRING: PuppetFailure: Puppet has failed on dns7002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:04:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on dns7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [17:04:14] ^ depooled, being reimaged [17:17:23] ah ok thanks [17:17:38] weird bird error complaining about a syntax error [17:18:11] ah I think because of the included file, just a quirk of how it is set up [17:18:22] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#12030675 (10Papaul) @BCornwall no just wanted to give you a heads up. [17:18:23] syntax is ok but refers to named var that is in the other file [17:28:46] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team, 06cloud-services-team (FY2025/2026-Q3-Q4): Establish a blackbox network probe vantage point into cloud realm - https://phabricator.wikimedia.org/T429451#12030754 (10cmooney) @fgiunchedi it seems like a reasonable idea yeah.... [17:34:07] 10netops, 10homer, 06Infrastructure-Foundations, 06SRE: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12030797 (10cmooney) 05Open→03Resolved a:03cmooney I'm going to close this one now. The patch to configure interfa... [18:16:55] FIRING: [4x] SystemdUnitFailed: haproxy.service on cp6011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:50] sukhe: random thought, it's probably not too hard to configure that alert to get suppressed on hosts with low uptime, which i think is the worst class of false alarm by a lot [18:22:04] ^ignore the haproxy error, that was from a reboot [18:22:18] and yes, I agree [18:23:11] But I'd rather fix it by not having haproxy's service fail repeatedly but have it wait until the tls materials exist [18:23:27] Something like ConditionPathExists [18:26:49] yeah good idea. [18:29:53] TIL [18:36:03] Probably the route would be to have e.g. haproxy-tls.target and have whatever unit populates that tmpfs volume activate it [18:36:20] then have After=haproxy-tls.target in haproxy.service [18:37:29] and then as a check, add ConditionDirectoryNotEmpty or something so haproxy doesn't try to start without tls materials [18:37:49] (in meeting, will follow later) [18:37:52] *then* we could make a separate alert that the tls keys haven't been populated correctly [19:10:52] 06Traffic: Remove Digicert CAA records from most domains - https://phabricator.wikimedia.org/T428093#12031203 (10Jgreen) a:05Jgreen→03BCornwall >>! In T428093#11982872, @ssingh wrote: >>>! In T428093#11982863, @taavi wrote: >>>>! In T428093#11982860, @ssingh wrote: >>> So `payments` is a CNAME and hence we c... [19:31:39] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12031283 (10BCornwall) 05Open→03Resolved [20:16:55] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on dns7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:55] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on dns7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:25] RESOLVED: SystemdUnitCrashLoop: pdns-recursor.service crashloop on dns7002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:41:22] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#12031546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1003 for host dns7002.wikimedia.org with OS trixie executed with errors: - dns7002 (**FAIL**) - Downtimed on Icinga/Alertmanager... [21:02:56] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12031657 (10cmooney) Juniper have come back to say this is known bug, somewhat expected I guess. ` After decoding the coredump, it was confirmed...