[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0000) [00:02:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:05:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T410589)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251202-000512-ladsgroup.json [00:05:20] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:05:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:05:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T410589)', diff saved to https://phabricator.wikimedia.org/P86271 and previous config saved to /var/cache/conftool/dbconfig/20251202-000540-ladsgroup.json [00:09:04] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:45] FIRING: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:31:56] (03CR) 10Reedy: OATHAuth: Expand 2FA to all users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (owner: 10Mstyles) [00:36:45] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:39:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213597 [00:39:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213597 (owner: 10TrainBranchBot) [00:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:42:03] (03CR) 10Dzahn: "I created https://releases.wikimedia.org/mwcli/newer_releases_here.html" [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [00:51:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213597 (owner: 10TrainBranchBot) [00:51:45] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:53:12] PROBLEM - dump of s6 in codfw on backupmon1001 is CRITICAL: Last dump for s6 at codfw (db2197) taken on 2025-12-02 00:00:08 is 61 GiB, but the previous one was 72 GiB, a change of -15.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:56:45] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213131 (owner: 10TrainBranchBot) [01:08:47] (03PS2) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 [01:09:06] PROBLEM - dump of s6 in eqiad on backupmon1001 is CRITICAL: Last dump for s6 at eqiad (db1225) taken on 2025-12-02 00:00:07 is 61 GiB, but the previous one was 72 GiB, a change of -15.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:09:06] PROBLEM - dump of s5 in eqiad on backupmon1001 is CRITICAL: Last dump for s5 at eqiad (db1216) taken on 2025-12-02 00:00:07 is 51 GiB, but the previous one was 60 GiB, a change of -15.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:10:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213599 [01:10:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213599 (owner: 10TrainBranchBot) [01:17:12] (03PS3) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) [01:17:24] (03PS4) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) [01:18:09] (03CR) 10Mstyles: OATHAuth: Expand 2FA to all users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [01:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:57] (03PS1) 10RLazarus: aux_k8s: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 [01:34:19] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [01:36:25] (03PS2) 10RLazarus: aux_k8s: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 [01:36:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [01:36:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86272 and previous config saved to /var/cache/conftool/dbconfig/20251202-013635-marostegui.json [01:36:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:36:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [01:37:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213599 (owner: 10TrainBranchBot) [01:37:41] (03PS1) 10Scott French: hieradata: enable cfssl/pki for etcd on conf1008 [puppet] - 10https://gerrit.wikimedia.org/r/1213600 (https://phabricator.wikimedia.org/T352245) [01:37:42] (03PS1) 10Scott French: hieradata: temporarily point eqiad LVS at conf1008 [puppet] - 10https://gerrit.wikimedia.org/r/1213601 (https://phabricator.wikimedia.org/T352245) [01:37:46] (03PS1) 10Scott French: hieradata: enable cfssl/pki for etcd on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1213602 (https://phabricator.wikimedia.org/T352245) [01:37:48] (03PS1) 10Scott French: hieradata: point eqiad LVS back to conf1007 [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) [01:37:56] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [01:48:17] (03PS3) 10RLazarus: aux_k8s: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 [01:48:28] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [02:09:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.5 [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213607 (https://phabricator.wikimedia.org/T408275) [02:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.5 [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213607 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [02:10:21] (03CR) 10RLazarus: aux_k8s: Write Envoy hieradata to YAML files for sophroid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [02:16:09] (03CR) 10Eevans: [C:04-2] "Your keyspace is 'analytics', and the tables are 'pageviews_per_editor' and 'pageviews_top_pages_per_editor' (and they already have MODIFY" [puppet] - 10https://gerrit.wikimedia.org/r/1213571 (https://phabricator.wikimedia.org/T410962) (owner: 10Aleksandar Mastilovic) [02:21:58] 06SRE, 10vrts, 10Znuny: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422302 (10Krd) [02:23:27] ^ VRTS is down [02:23:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.5 [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213607 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [02:26:56] 06SRE, 06collaboration-services, 10vrts, 10Znuny: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422304 (10Krd) [02:34:49] (03PS1) 10C. Scott Ananian: ParserOutputAccess: use consistent metrics labels [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213609 [02:35:31] (03CR) 10C. Scott Ananian: [C:03+2] "Just missed the train branch; fixes CI for VisualEditor." [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213609 (owner: 10C. Scott Ananian) [02:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:26] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422309 (10AntiCompositeNumber) [02:46:08] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422311 (10AntiCompositeNumber) https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=2025-11-02T02:44:01.700Z&to=2025-12-02T02:43:01.700Z&timezone=utc&var-no... [02:48:24] (03Merged) 10jenkins-bot: ParserOutputAccess: use consistent metrics labels [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213609 (owner: 10C. Scott Ananian) [02:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0300) [03:04:32] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422315 (10ssingh) @Dzahn has freed up some inodes. We were not out of disk space, we were out of inodes. We are trying to free up some more but for now, we should... [03:06:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86273 and previous config saved to /var/cache/conftool/dbconfig/20251202-030615-marostegui.json [03:06:20] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:06:21] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:14:28] !log vrts1003 - sudo -u otrs ./bin/otrs.Console.pl Maint::Cache::Delete (T411452) [03:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:31] T411452: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452 [03:15:12] !log vrts1003 - compressed /opt/znuny-6.5.16 and .17 to .tar.gz files - then deleted uncompressed versions - freeing about 700k inodes (T411452) [03:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:44] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422337 (10Dzahn) in this context I found: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009303/3/modules/vrts/manifests/init.pp which disabled a timer th... [03:21:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86274 and previous config saved to /var/cache/conftool/dbconfig/20251202-032122-marostegui.json [03:25:35] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422339 (10Dzahn) p:05Unbreak!→03High inode usage on / is back to 2% - exim logs show emails are going out therefore lowering from UBN to High [03:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:34:09] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422362 (10Dzahn) [03:36:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86275 and previous config saved to /var/cache/conftool/dbconfig/20251202-033630-marostegui.json [03:38:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [03:39:35] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422377 (10Krd) There are now some tickets where responses have not been sent out. How do we find these tickets? [03:41:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:41:52] !incidents [03:41:53] 7072 (ACKED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [03:41:53] 7073 (UNACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [03:41:53] 7071 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [03:41:54] 7068 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr2-eqord:9804 Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, SR17915277) {#11374} xe-0/1/4 gnmi eqiad) [03:41:54] 7070 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:41:54] 7069 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:41:57] !ack 7073 [03:41:57] 7073 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [03:42:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:42:58] !ack 7074 [03:42:59] 7074 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [03:43:15] !ack 7073 [03:43:15] 7073 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [03:44:30] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [03:47:20] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Thanos [03:48:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-1/0/1:1 (Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSatu [03:49:04] !incidents [03:49:04] 7072 (ACKED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [03:49:05] 7073 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [03:49:05] 7074 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [03:49:05] 7075 (UNACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [03:49:05] 7071 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [03:49:05] 7068 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr2-eqord:9804 Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, SR17915277) {#11374} xe-0/1/4 gnmi eqiad) [03:49:06] 7070 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:49:06] 7069 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:49:10] !ack 7075 [03:49:11] 7075 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [03:51:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86276 and previous config saved to /var/cache/conftool/dbconfig/20251202-035138-marostegui.json [03:51:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:51:43] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:51:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [03:52:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86277 and previous config saved to /var/cache/conftool/dbconfig/20251202-035202-marostegui.json [03:52:46] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:53:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-1/0/1:1 (Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSa [03:54:05] !incidents [03:54:05] 7072 (ACKED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [03:54:05] 7073 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [03:54:06] 7074 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [03:54:06] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [03:54:06] 7071 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [03:54:06] 7068 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr2-eqord:9804 Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, SR17915277) {#11374} xe-0/1/4 gnmi eqiad) [03:54:06] 7070 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:54:07] 7069 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [03:56:45] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:57:46] FIRING: [2x] Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:58:04] device recovered? [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0400) [04:01:45] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [04:02:27] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213615 (https://phabricator.wikimedia.org/T408275) [04:02:29] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213615 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [04:03:18] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213615 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [04:03:49] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.5 refs T408275 [04:03:52] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [04:16:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-magru.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [04:17:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:18:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [04:18:59] !incidents [04:18:59] 7072 (ACKED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [04:18:59] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [04:19:00] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [04:19:00] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [04:19:00] 7071 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [04:19:00] 7068 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr2-eqord:9804 Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, SR17915277) {#11374} xe-0/1/4 gnmi eqiad) [04:19:00] 7070 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [04:19:01] 7069 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [04:34:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86278 and previous config saved to /var/cache/conftool/dbconfig/20251202-043424-marostegui.json [04:34:30] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:34:30] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:38:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:38:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [04:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:48:35] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.5 refs T408275 (duration: 44m 45s) [04:48:37] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [04:49:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86279 and previous config saved to /var/cache/conftool/dbconfig/20251202-044931-marostegui.json [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0500) [05:02:58] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.2 (duration: 02m 56s) [05:03:48] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213629 [05:04:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86280 and previous config saved to /var/cache/conftool/dbconfig/20251202-050439-marostegui.json [05:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:19:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86281 and previous config saved to /var/cache/conftool/dbconfig/20251202-051947-marostegui.json [05:19:52] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:19:52] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:20:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [05:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:20:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86282 and previous config saved to /var/cache/conftool/dbconfig/20251202-052010-marostegui.json [05:20:16] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: disk full at VRTS host? - https://phabricator.wikimedia.org/T411452#11422562 (10Krd) Disregard my last entry. Tickets have been identified. [05:34:14] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11422574 (10Papaul) @ayounsi @cmooney please see below the steps to replace the loopback IPs on cr3/4-ulsfo and mr1-ulsfo If all this looks good, I will setup... [05:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:42:46] (03PS2) 10KartikMistry: Update cxserver to 2025-12-02-041957-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213141 [05:45:30] Deploying cxserver: config/build udpates.. [05:46:05] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-12-02-041957-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213141 (owner: 10KartikMistry) [05:47:50] (03Merged) 10jenkins-bot: Update cxserver to 2025-12-02-041957-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213141 (owner: 10KartikMistry) [05:49:36] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:50:01] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:52:00] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:52:31] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:57:47] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:59:05] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:59:29] !log Updated cxserver to 2025-12-02-041957-production + Yandex key removal from production config [05:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:48] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:32:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2228.codfw.wmnet with reason: Schema change [06:36:33] (03PS1) 10Marostegui: site.pp: Change note [puppet] - 10https://gerrit.wikimedia.org/r/1213899 [06:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:10] (03CR) 10Marostegui: [C:03+2] site.pp: Change note [puppet] - 10https://gerrit.wikimedia.org/r/1213899 (owner: 10Marostegui) [06:50:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T410589)', diff saved to https://phabricator.wikimedia.org/P86283 and previous config saved to /var/cache/conftool/dbconfig/20251202-065007-ladsgroup.json [06:50:11] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0700) [07:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0700). [07:05:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86284 and previous config saved to /var/cache/conftool/dbconfig/20251202-070514-ladsgroup.json [07:15:19] (03CR) 10Ryan Kemper: [C:03+1] query_service: alert on high number of JVM thread (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1212170 (https://phabricator.wikimedia.org/T389859) (owner: 10Gehel) [07:15:31] (03CR) 10Ryan Kemper: [C:03+2] query_service: alert on high number of JVM thread [alerts] - 10https://gerrit.wikimedia.org/r/1212170 (https://phabricator.wikimedia.org/T389859) (owner: 10Gehel) [07:17:09] (03Merged) 10jenkins-bot: query_service: alert on high number of JVM thread [alerts] - 10https://gerrit.wikimedia.org/r/1212170 (https://phabricator.wikimedia.org/T389859) (owner: 10Gehel) [07:20:13] (03CR) 10Slyngshede: [C:03+2] Form labels: Fix for labels for Codex styled forms [software/bitu] - 10https://gerrit.wikimedia.org/r/1213452 (https://phabricator.wikimedia.org/T410492) (owner: 10Slyngshede) [07:20:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86285 and previous config saved to /var/cache/conftool/dbconfig/20251202-072022-ladsgroup.json [07:21:03] (03CR) 10Ryan Kemper: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [07:22:55] (03Merged) 10jenkins-bot: Form labels: Fix for labels for Codex styled forms [software/bitu] - 10https://gerrit.wikimedia.org/r/1213452 (https://phabricator.wikimedia.org/T410492) (owner: 10Slyngshede) [07:22:59] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, much nicer" [puppet] - 10https://gerrit.wikimedia.org/r/1213436 (https://phabricator.wikimedia.org/T411081) (owner: 10Majavah) [07:23:43] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work, 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11422622 (10RKemper) The current iteration of https://gerrit.wikimedia.org/r/c/operations... [07:23:49] (03CR) 10Slyngshede: [C:03+1] Deprecate restbase-roots/restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/1213528 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:30:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:35:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:35:22] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11422629 (10Marostegui) [07:35:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T410589)', diff saved to https://phabricator.wikimedia.org/P86286 and previous config saved to /var/cache/conftool/dbconfig/20251202-073530-ladsgroup.json [07:35:33] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [07:35:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [07:35:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T410589)', diff saved to https://phabricator.wikimedia.org/P86287 and previous config saved to /var/cache/conftool/dbconfig/20251202-073553-ladsgroup.json [07:42:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212150 (https://phabricator.wikimedia.org/T408737) (owner: 10DCausse) [07:45:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:45:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:53:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2223.codfw.wmnet with reason: Schema change [07:59:30] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11422638 (10Aklapper) [08:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0800). [08:00:04] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:52] o/ [08:01:00] I can deploy [08:02:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212150 (https://phabricator.wikimedia.org/T408737) (owner: 10DCausse) [08:03:14] (03Merged) 10jenkins-bot: cirrus: enable georgian transliteration second try profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212150 (https://phabricator.wikimedia.org/T408737) (owner: 10DCausse) [08:04:22] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1212150|cirrus: enable georgian transliteration second try profile (T408737)]] [08:04:25] T408737: Enable Georgian Transliteration Second Try mappings for autocomplete - https://phabricator.wikimedia.org/T408737 [08:06:34] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1212150|cirrus: enable georgian transliteration second try profile (T408737)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:06:45] testing [08:09:05] !log dcausse@deploy2002 dcausse: Continuing with sync [08:11:02] (03PS1) 10Kosta Harlan: Allow similar signals to be merged into an existing case [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213949 (https://phabricator.wikimedia.org/T410303) [08:11:10] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: extend README with dummy network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1213426 (owner: 10Filippo Giunchedi) [08:12:24] PROBLEM - LDAP -writable server- on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [08:14:22] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212150|cirrus: enable georgian transliteration second try profile (T408737)]] (duration: 10m 00s) [08:14:25] T408737: Enable Georgian Transliteration Second Try mappings for autocomplete - https://phabricator.wikimedia.org/T408737 [08:16:07] (03CR) 10Muehlenhoff: [C:03+2] Deprecate restbase-roots/restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/1213528 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:17:18] !log closing the utc morning backport window [08:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86288 and previous config saved to /var/cache/conftool/dbconfig/20251202-081758-marostegui.json [08:18:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:18:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:18:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:20:44] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11422669 (10MoritzMuehlenhoff) [08:21:17] FIRING: ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:23:15] (03PS2) 10Daniel Kinzler: rest gateway: do not rate limit internal traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213504 (https://phabricator.wikimedia.org/T410143) [08:23:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86289 and previous config saved to /var/cache/conftool/dbconfig/20251202-082345-marostegui.json [08:23:51] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:23:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:26:17] FIRING: [16x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:27:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:31:17] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:32] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:33:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86290 and previous config saved to /var/cache/conftool/dbconfig/20251202-083306-marostegui.json [08:35:24] RECOVERY - LDAP -writable server- on serpens is OK: LDAP OK - 0.102 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [08:36:17] FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:38:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86291 and previous config saved to /var/cache/conftool/dbconfig/20251202-083853-marostegui.json [08:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:40:55] !log restarting wdqs@codfw - system overloaded [08:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:17] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:46:17] RESOLVED: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:47:42] (03CR) 10Elukey: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:48:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86292 and previous config saved to /var/cache/conftool/dbconfig/20251202-084813-marostegui.json [08:50:52] (03CR) 10Elukey: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [08:52:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:52:55] (03CR) 10Elukey: [C:03+2] iPXE MBR support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [08:54:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86293 and previous config saved to /var/cache/conftool/dbconfig/20251202-085401-marostegui.json [08:57:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0900) [09:01:19] o/ [09:01:26] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213951 (https://phabricator.wikimedia.org/T408275) [09:01:28] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213951 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:02:02] RESOLVED: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:02:20] Parser cache is to be watched per the "risky patch" notice at https://phabricator.wikimedia.org/T408275#11421500 [09:02:27] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213951 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:02:29] (03Merged) 10jenkins-bot: iPXE MBR support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [09:03:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86294 and previous config saved to /var/cache/conftool/dbconfig/20251202-090321-marostegui.json [09:03:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:03:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:03:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Maintenance [09:03:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86295 and previous config saved to /var/cache/conftool/dbconfig/20251202-090334-marostegui.json [09:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:09:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86296 and previous config saved to /var/cache/conftool/dbconfig/20251202-090908-marostegui.json [09:09:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:09:14] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:09:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [09:09:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86297 and previous config saved to /var/cache/conftool/dbconfig/20251202-090932-marostegui.json [09:09:48] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.5 refs T408275 [09:09:51] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [09:11:44] (03CR) 10Elukey: [C:03+1] Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [09:15:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1213600 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [09:18:42] looks quiet [09:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:21:44] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1213952 [09:24:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11422756 (10ayounsi) Awesome, thx! The loopbacks are also in Puppet : https://github.com/search?q=repo%3Awikimedia%2Foperations-puppet%20198.35.26.193&type=co... [09:27:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1213601 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [09:28:12] (03CR) 10Ayounsi: "I think all the cumin hosts are now running bookworm, so python >=3.11" [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [09:30:40] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1213952 (owner: 10Elukey) [09:32:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1213602 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [09:34:24] (03PS1) 10Elukey: Upstream release v12.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1213963 [09:34:47] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1213963 (owner: 10Elukey) [09:37:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [09:38:24] !log upgrade Envoy on parsoidtest/testreduce T405808 [09:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:27] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [09:41:31] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:41:52] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:42:21] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:43:07] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:43:19] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:43:36] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:45:42] (03PS2) 10Bartosz Wójtowicz: ml-services: Update image_version for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212536 (https://phabricator.wikimedia.org/T408538) [09:46:44] !log uploaded spicerack_12.1.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [09:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] (03CR) 10Ayounsi: [C:03+2] Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [09:49:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86298 and previous config saved to /var/cache/conftool/dbconfig/20251202-094931-marostegui.json [09:49:45] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:49:46] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:50:27] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:50:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:51:52] (03Merged) 10jenkins-bot: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [09:52:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2223 gradually with 4 steps - After switchover [09:53:07] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db2223 gradually with 4 steps - After switchover [09:53:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2223 gradually with 4 steps - After switchover [09:54:21] (03CR) 10AikoChou: [C:03+1] ml-services: Update image_version for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212536 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:55:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [09:58:07] (03PS1) 10Kosta Harlan: UserInfoCard: Hide activity graph when it's likely to be inaccurate [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213970 (https://phabricator.wikimedia.org/T400409) [10:01:57] (03CR) 10Hnowlan: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [10:03:22] (03PS1) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [10:04:37] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:04:43] jouncebot: nowandnext [10:04:44] For the next 0 hour(s) and 55 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T0900) [10:04:44] In 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1100) [10:05:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:07:11] hashar: Dreamy_Jazz and I will backport a patch and private settings change, if you're done with the train deploy? [10:10:16] (03PS2) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [10:10:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213949 (https://phabricator.wikimedia.org/T410303) (owner: 10Kosta Harlan) [10:11:03] (03CR) 10Klausman: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [10:12:32] (03Merged) 10jenkins-bot: Allow similar signals to be merged into an existing case [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213949 (https://phabricator.wikimedia.org/T410303) (owner: 10Kosta Harlan) [10:12:46] (03CR) 10Dpogorzelski: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [10:13:08] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1213949|Allow similar signals to be merged into an existing case (T410303)]] [10:14:25] (03PS1) 10Btullis: Add a role and a rolebinding to allow spark driver pods to function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 [10:15:18] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1213949|Allow similar signals to be merged into an existing case (T410303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:16:23] (03CR) 10Hnowlan: [C:03+1] rest gateway: do not rate limit internal traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213504 (https://phabricator.wikimedia.org/T410143) (owner: 10Daniel Kinzler) [10:17:01] !log kharlan@deploy2002 kharlan: Continuing with sync [10:20:53] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11422918 (10tappof) [10:21:00] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213949|Allow similar signals to be merged into an existing case (T410303)]] (duration: 07m 52s) [10:22:50] (03CR) 10Brouberol: Add a role and a rolebinding to allow spark driver pods to function (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 (owner: 10Btullis) [10:22:56] (03PS1) 10Ayounsi: Interface validator: need to allow QSFP ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1213974 [10:23:25] (03PS2) 10Ayounsi: Interface validator: need to allow QSFP ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1213974 [10:23:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2223 gradually with 4 steps - After switchover [10:23:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213970 (https://phabricator.wikimedia.org/T400409) (owner: 10Kosta Harlan) [10:25:12] (03Merged) 10jenkins-bot: UserInfoCard: Hide activity graph when it's likely to be inaccurate [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1213970 (https://phabricator.wikimedia.org/T400409) (owner: 10Kosta Harlan) [10:25:43] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1213970|UserInfoCard: Hide activity graph when it's likely to be inaccurate (T400409)]] [10:25:47] T400409: UserInfoCard: Sometimes the activity graph shows that there are no edits until a certain date - https://phabricator.wikimedia.org/T400409 [10:26:42] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image_version for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212536 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:26:51] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1213974 (owner: 10Ayounsi) [10:27:49] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1213970|UserInfoCard: Hide activity graph when it's likely to be inaccurate (T400409)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:28:29] (03Merged) 10jenkins-bot: ml-services: Update image_version for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212536 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:29:24] (03CR) 10Elukey: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [10:29:42] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:30:10] (03CR) 10Ayounsi: [C:03+2] Interface validator: need to allow QSFP ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1213974 (owner: 10Ayounsi) [10:31:48] (03Merged) 10jenkins-bot: Interface validator: need to allow QSFP ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1213974 (owner: 10Ayounsi) [10:31:56] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:32:11] !log kharlan@deploy2002 kharlan: Continuing with sync [10:32:36] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [10:32:37] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:33:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:33:55] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:35:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:36:09] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213970|UserInfoCard: Hide activity graph when it's likely to be inaccurate (T400409)]] (duration: 10m 26s) [10:36:12] T400409: UserInfoCard: Sometimes the activity graph shows that there are no edits until a certain date - https://phabricator.wikimedia.org/T400409 [10:36:32] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1005.eqiad.wmnet [10:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:42] !log upgrade spicerack to 12.1.0 on all cumin hosts [10:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:13] (03CR) 10Kosta Harlan: [C:04-2] "Hold until December 8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211660 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [10:41:32] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:42:02] (03CR) 10Elukey: [C:03+1] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [10:42:47] (03PS3) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [10:43:17] (03CR) 10Ayounsi: "Thanks for the feedback, now I need to test it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [10:47:34] Dreamy_Jazz is syncing private code changes now [10:49:44] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11423030 (10MoritzMuehlenhoff) >>! In T410195#11420424, @Jhancock.wm wrote: > @MoritzMuehlenhoff thanks for the help and correction. 22353BB15C0C has been replaced. Thanks. > They are hot-swappable. Its a... [10:51:45] !log rebuild software raid following disk swap on bast2003 T410195 [10:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:48] T410195: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195 [10:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1100) [11:01:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:02:07] !incidents [11:02:08] 7077 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [11:02:08] 7072 (RESOLVED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [11:02:08] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [11:02:08] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [11:02:08] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [11:02:23] checking [11:02:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:02:50] eqsin transport [11:03:39] godog: false positive, librenms still thinks it's a 6G circuit [11:03:43] do you all need the MW infrastructure window, or can we continue with MW backports? [11:04:04] XioNoX: hah! good news [11:04:35] kostajh: seems like we're fine, monitoring artifact as opposed to an issue cc XioNoX [11:05:15] ok, I'll go ahead then. [11:06:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211659 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [11:06:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211658 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [11:06:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:07:18] (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha editing in 100% passive mode on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211658 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [11:07:20] (03Merged) 10jenkins-bot: hCaptcha: Switch frwiki to 99.9% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211659 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [11:07:41] the other side should recover soon as well [11:07:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:07:54] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1211659|hCaptcha: Switch frwiki to 99.9% passive mode (T405586)]], [[gerrit:1211658|hCaptcha: Enable hCaptcha editing in 100% passive mode on enwiki (T405586)]] [11:07:56] very efficient alerting system [11:07:57] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [11:10:02] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1211659|hCaptcha: Switch frwiki to 99.9% passive mode (T405586)]], [[gerrit:1211658|hCaptcha: Enable hCaptcha editing in 100% passive mode on enwiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:12:48] (03CR) 10Ayounsi: [C:03+2] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [11:12:51] !log kharlan@deploy2002 kharlan: Continuing with sync [11:13:26] (03PS1) 10Urbanecm: [Growth] Enable Add Link for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) [11:13:35] (03PS2) 10Btullis: Add an RBAC configuration to allow spark driver pods to function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 [11:14:04] (03CR) 10Btullis: Add an RBAC configuration to allow spark driver pods to function (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 (owner: 10Btullis) [11:16:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211659|hCaptcha: Switch frwiki to 99.9% passive mode (T405586)]], [[gerrit:1211658|hCaptcha: Enable hCaptcha editing in 100% passive mode on enwiki (T405586)]] (duration: 08m 55s) [11:16:52] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [11:26:08] (03PS1) 10Kosta Harlan: wgAutoConfirmCount: Raise value to 10 for frwiki, idwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213995 (https://phabricator.wikimedia.org/T411263) [11:26:33] I have one more config patch, as long as that isn't disrupting anyone else [11:27:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11423123 (10Aklapper) The data provided in this task is partially outdated so I wonder about the user journey / where that came from? The form to use nowa... [11:30:01] (03CR) 10Dreamy Jazz: [C:03+1] "This does have some side effects including:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213995 (https://phabricator.wikimedia.org/T411263) (owner: 10Kosta Harlan) [11:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:31:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213995 (https://phabricator.wikimedia.org/T411263) (owner: 10Kosta Harlan) [11:32:06] (03Merged) 10jenkins-bot: wgAutoConfirmCount: Raise value to 10 for frwiki, idwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213995 (https://phabricator.wikimedia.org/T411263) (owner: 10Kosta Harlan) [11:32:14] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: do not rate limit internal traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213504 (https://phabricator.wikimedia.org/T410143) (owner: 10Daniel Kinzler) [11:32:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1213995|wgAutoConfirmCount: Raise value to 10 for frwiki, idwiki, trwiki (T411263)]] [11:32:37] T411263: hCaptcha: Raise wgAutoConfirmCount to 10 for frwiki, idwiki, trwiki - https://phabricator.wikimedia.org/T411263 [11:32:39] (03CR) 10Btullis: [C:03+2] Add an RBAC configuration to allow spark driver pods to function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 (owner: 10Btullis) [11:33:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [11:34:14] (03Merged) 10jenkins-bot: rest gateway: do not rate limit internal traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213504 (https://phabricator.wikimedia.org/T410143) (owner: 10Daniel Kinzler) [11:34:45] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1213995|wgAutoConfirmCount: Raise value to 10 for frwiki, idwiki, trwiki (T411263)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:35:08] (03Merged) 10jenkins-bot: Add an RBAC configuration to allow spark driver pods to function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213973 (owner: 10Btullis) [11:35:30] kostajh: hey, I'm about to deploy an update to deployment-charts. should I wait until you are done? [11:35:42] duesen: I'll be done in a few minutes [11:35:49] (03CR) 10Michael Große: [C:03+1] [Growth] Enable Add Link for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [11:35:50] !log kharlan@deploy2002 kharlan: Continuing with sync [11:35:58] kostajh: ok thanks [11:36:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [11:36:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86302 and previous config saved to /var/cache/conftool/dbconfig/20251202-113625-marostegui.json [11:36:30] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:36:31] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:37:17] !log rebuild RAID on ms-fe2014 T410959 [11:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:20] T410959: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959 [11:40:00] (03CR) 10Urbanecm: [C:04-1] "issue: `wgGEUseMetricsPlatformExtension` seems to be false for ptwiki, which make it impossible to use the feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [11:41:02] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213995|wgAutoConfirmCount: Raise value to 10 for frwiki, idwiki, trwiki (T411263)]] (duration: 08m 28s) [11:41:05] T411263: hCaptcha: Raise wgAutoConfirmCount to 10 for frwiki, idwiki, trwiki - https://phabricator.wikimedia.org/T411263 [11:41:56] duesen: over to you, although I'll have another private code change to make when you're done [11:42:18] kostajh: thanks. I'm waiting for my patch to propagate to the deployment server [11:44:15] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:44:58] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:46:17] (03PS3) 10Michael Große: Growth: Enable Revise Tone feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) [11:46:17] (03CR) 10Michael Große: Growth: Enable Revise Tone feature on pilot wikis (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [11:46:29] (03PS4) 10Michael Große: Growth: Enable Revise Tone feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) [11:46:51] (03PS1) 10Federico Ceratto: production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) [11:48:39] (03CR) 10CI reject: [V:04-1] production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [11:49:47] duesen: please let me know when you're done [11:50:14] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:50:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [11:51:37] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [11:52:02] (03CR) 10Urbanecm: [C:03+1] "Hmm... We should make it harder to accidentally deploy stuff for everyone. Is there a reason to not switch all wikis to xlab already?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [11:52:53] (03PS2) 10Federico Ceratto: production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) [11:54:43] (03CR) 10Marostegui: production-m2.sql.erb: Add requestctl database grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [11:56:15] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:57:04] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:57:19] kostajh: if all goes well i'm done in about 10 minutes [12:00:33] (03CR) 10Marostegui: [C:04-1] "You also need to add the codfw proxy and grants there." [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [12:00:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T410589)', diff saved to https://phabricator.wikimedia.org/P86303 and previous config saved to /var/cache/conftool/dbconfig/20251202-120046-ladsgroup.json [12:00:50] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11423243 (10cmooney) >>! In T405499#11411590, @VRiley-WMF wrote: > Hey @cmooney It has been reused for that purpose, however it's still being worked on... [12:04:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [12:04:23] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:04:47] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:04:49] (03Abandoned) 10Kosta Harlan: hCaptcha: Disable hCaptcha for projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187740 (owner: 10Kosta Harlan) [12:05:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-d8-eqiad cross-rack links incorrect in Netbox - https://phabricator.wikimedia.org/T411480 (10cmooney) 03NEW p:05Triage→03Medium [12:07:10] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:07] (03PS1) 10Volans: wmcs infra-tracing: simplify Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) [12:10:57] (03CR) 10CI reject: [V:04-1] wmcs infra-tracing: simplify Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:11:35] (03PS2) 10Volans: wmcs infra-tracing: simplify Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) [12:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1213540 (https://phabricator.wikimedia.org/T411404) (owner: 10Kamila Součková) [12:14:45] (03CR) 10Volans: "Tested on one host on toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:14:45] kostajh: i'm done, go ahead! [12:15:02] duesen: thanks! [12:15:04] jouncebot: nowandnext [12:15:04] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [12:15:05] In 0 hour(s) and 44 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1300) [12:15:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86304 and previous config saved to /var/cache/conftool/dbconfig/20251202-121554-ladsgroup.json [12:16:17] (03PS1) 10Kosta Harlan: SI: Skip successfuledit event for null edits [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214016 (https://phabricator.wikimedia.org/T410280) [12:16:30] (03PS1) 10Kosta Harlan: SI: Skip successfuledit event for null edits [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214017 (https://phabricator.wikimedia.org/T410280) [12:16:31] (03PS3) 10Federico Ceratto: production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) [12:17:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214017 (https://phabricator.wikimedia.org/T410280) (owner: 10Kosta Harlan) [12:17:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214016 (https://phabricator.wikimedia.org/T410280) (owner: 10Kosta Harlan) [12:17:25] (03PS4) 10Federico Ceratto: production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) [12:17:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm [12:19:32] (03CR) 10Federico Ceratto: "I added dbproxy2006 as it's used in other grants in production-m2.sql.erb" [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [12:20:36] (03CR) 10Marostegui: [C:04-1] "Proxies are wrong, you've selected m1 proxies." [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [12:20:41] (03PS3) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 [12:30:59] (03PS1) 10Muehlenhoff: Fix EFIfied DB reuse Partman config [puppet] - 10https://gerrit.wikimedia.org/r/1214019 (https://phabricator.wikimedia.org/T410400) [12:31:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86305 and previous config saved to /var/cache/conftool/dbconfig/20251202-123102-ladsgroup.json [12:33:14] (03Merged) 10jenkins-bot: SI: Skip successfuledit event for null edits [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214017 (https://phabricator.wikimedia.org/T410280) (owner: 10Kosta Harlan) [12:33:16] (03Merged) 10jenkins-bot: SI: Skip successfuledit event for null edits [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214016 (https://phabricator.wikimedia.org/T410280) (owner: 10Kosta Harlan) [12:33:37] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.026e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [12:33:45] (03CR) 10Muehlenhoff: [C:03+2] Fix EFIfied DB reuse Partman config [puppet] - 10https://gerrit.wikimedia.org/r/1214019 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff) [12:33:49] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1214017|SI: Skip successfuledit event for null edits (T410280)]], [[gerrit:1214016|SI: Skip successfuledit event for null edits (T410280)]] [12:35:56] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1214017|SI: Skip successfuledit event for null edits (T410280)]], [[gerrit:1214016|SI: Skip successfuledit event for null edits (T410280)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:36:26] !log kharlan@deploy2002 kharlan: Continuing with sync [12:37:52] (03PS5) 10Federico Ceratto: production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) [12:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:40:18] (03CR) 10Federico Ceratto: "I updated the grants." [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [12:40:27] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214017|SI: Skip successfuledit event for null edits (T410280)]], [[gerrit:1214016|SI: Skip successfuledit event for null edits (T410280)]] (duration: 06m 39s) [12:40:29] (03CR) 10Marostegui: [C:03+1] production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [12:41:38] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1169.eqiad.wmnet with OS bookworm [12:42:38] (03PS1) 10Btullis: Remove unnecessary and/or incorrect hadoop/spark config options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214020 (https://phabricator.wikimedia.org/T410017) [12:43:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm [12:46:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T410589)', diff saved to https://phabricator.wikimedia.org/P86307 and previous config saved to /var/cache/conftool/dbconfig/20251202-124609-ladsgroup.json [12:46:13] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:46:25] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [12:46:32] (03PS1) 10Btullis: Sort the hadoop/spark config items alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214021 (https://phabricator.wikimedia.org/T406833) [12:46:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T410589)', diff saved to https://phabricator.wikimedia.org/P86308 and previous config saved to /var/cache/conftool/dbconfig/20251202-124632-ladsgroup.json [12:51:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11423398 (10cmooney) @papaul as @ayounsi mentions you need to change it in puppet where it is also. Principally to change what IPs the hosts doing BGP are goi... [12:53:55] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat" [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:55:04] (03PS1) 10Muehlenhoff: Deprecate sessionstore-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214022 (https://phabricator.wikimedia.org/T276465) [12:56:26] (03PS3) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [12:56:44] (03CR) 10Dpogorzelski: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [12:56:56] (03CR) 10Brouberol: [C:03+1] "Is there anything we need to backport to the airflow chart itself?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214020 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:57:11] (03CR) 10Brouberol: [C:03+1] Sort the hadoop/spark config items alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214021 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:59:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1300) [13:03:08] (03CR) 10Btullis: [C:03+2] Remove unnecessary and/or incorrect hadoop/spark config options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214020 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [13:03:11] (03CR) 10Btullis: [C:03+2] Sort the hadoop/spark config items alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214021 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:03:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [13:04:59] (03Merged) 10jenkins-bot: Remove unnecessary and/or incorrect hadoop/spark config options [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214020 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [13:05:01] (03Merged) 10jenkins-bot: Sort the hadoop/spark config items alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214021 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [13:07:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:11:04] (03CR) 10Volans: [C:03+2] wmcs infra-tracing: simplify Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1214012 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:15:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:17:11] ^ me, cf #-sre. I had put a pre-emptive silence in alertmanager, but I obviously messed up. Sorry [13:18:13] I copied an old silence, changed the cluster name but not the site. My bad. [13:20:07] (03PS2) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) [13:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:45] (03PS4) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [13:21:43] (03PS3) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) [13:25:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS bookworm [13:25:58] (03CR) 10Jelto: [C:03+1] gerrit: remove firewall rule to accept Wikimania traffic [puppet] - 10https://gerrit.wikimedia.org/r/1201793 (owner: 10Dzahn) [13:31:38] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8082 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:34:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:39:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:41:41] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11423679 (10Arnoldokoth) @Dzahn Yes, I disabled that timer because it conflicted with another one run by the built-in VRTS daemon. Running both of them... [13:44:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:45:48] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11423686 (10cmooney) 05Open→03Resolved Thanks to the awesome work of @jhathaway this is no longer a requirement. We can use `--no82` with a host in BIOS... [13:46:13] (03PS1) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [13:47:18] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214035 [13:47:19] (03PS1) 10Sbisson: CX3 Build 1.0.0+20251201 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) [13:47:49] (03CR) 10CI reject: [V:04-1] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [13:49:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:49:05] (03PS2) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [13:49:44] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214035 (owner: 10PipelineBot) [13:50:16] (03CR) 10CI reject: [V:04-1] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [13:51:36] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214035 (owner: 10PipelineBot) [13:52:06] (03PS1) 10D3r1ck01: user: Mark users created with User::addToDatabase() as primary [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214041 (https://phabricator.wikimedia.org/T410652) [13:52:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214041 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [13:54:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:54:07] (03PS3) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [13:54:27] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:55:00] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:55:19] (03CR) 10CI reject: [V:04-1] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [13:56:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86309 and previous config saved to /var/cache/conftool/dbconfig/20251202-135600-marostegui.json [13:56:05] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:56:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:56:11] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:56:33] (03PS4) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [13:56:42] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:56:51] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:56:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [13:57:19] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:58:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1251.eqiad.wmnet onto db1169.eqiad.wmnet [13:58:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1251 - Depool db1251.eqiad.wmnet to then clone it to db1169.eqiad.wmnet - marostegui@cumin1003 [13:58:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1251 - Depool db1251.eqiad.wmnet to then clone it to db1169.eqiad.wmnet - marostegui@cumin1003 [13:59:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1400) [14:00:05] MichaelG_WMF and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:32] o/ [14:00:36] * MichaelG_WMF is here [14:00:50] MichaelG_WMF, I'll self service after you're done :) [14:01:03] o/ [14:02:00] I can deploy [14:02:13] xSavitar, you go ahead [14:02:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [14:02:15] MichaelG_WMF: looking at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1208357/4/wmf-config/ext-GrowthExperiments.php, is it intentional that wgGEReviseToneSuggestedEditEnabled is missing eswiki [14:02:22] (compared to wgGEUseMetricsPlatformExtension) [14:02:26] ? [14:02:36] Lucas_WMDE, okay re deployment [14:03:23] Yes, it is. We are not intending to deploy to eswiki for this. In general the plan is to make `wgGEUseMetricsPlatformExtension` default to `true`, but that will be a separate change, not today. [14:03:30] ok, just wanted to check [14:03:34] @Lucas_WMDE Thanks for checking though! [14:03:43] then let’s start with that [14:03:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [14:04:53] (03PS1) 10Marostegui: installserver: Place holder for DBs with uefi [puppet] - 10https://gerrit.wikimedia.org/r/1214050 [14:05:02] (03Merged) 10jenkins-bot: Growth: Enable Revise Tone feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [14:05:38] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208357|Growth: Enable Revise Tone feature on pilot wikis (T409606)]] [14:05:41] T409606: Release Revise Tone to pilot Wikipedias for opt-in testing (disabled by default) - https://phabricator.wikimedia.org/T409606 [14:06:05] (03PS2) 10Marostegui: installserver: Place holder for DBs with uefi [puppet] - 10https://gerrit.wikimedia.org/r/1214050 [14:06:57] * urbanecm is here [14:07:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [14:07:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1208357|Growth: Enable Revise Tone feature on pilot wikis (T409606)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:02] (03CR) 10Marostegui: [C:03+2] installserver: Place holder for DBs with uefi [puppet] - 10https://gerrit.wikimedia.org/r/1214050 (owner: 10Marostegui) [14:08:18] MichaelG_WMF: anything to test? [14:08:36] @Lucas_WMDE Yes! let me quickly check the wikis [14:09:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:09:16] Lucas_WMDE: i very boldly added a patch. happy to self-service whenever convenient. [14:09:37] (03PS1) 10Brouberol: airflow: fix the airflow-devenv destroy command [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214053 (https://phabricator.wikimedia.org/T411499) [14:11:07] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ganeti-test2001.codfw.wmnet with reason: test CR1207804 [14:11:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P86311 and previous config saved to /var/cache/conftool/dbconfig/20251202-141108-marostegui.json [14:11:10] (03CR) 10Btullis: [C:03+1] airflow: fix the airflow-devenv destroy command [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214053 (https://phabricator.wikimedia.org/T411499) (owner: 10Brouberol) [14:11:48] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:12:12] (03CR) 10Brouberol: [C:03+2] airflow: fix the airflow-devenv destroy command [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214053 (https://phabricator.wikimedia.org/T411499) (owner: 10Brouberol) [14:12:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:12:46] urbanecm: *insert lolcat here* [14:12:50] “I’m in urwiki” [14:12:52] “enabling urlink” [14:13:07] * urbanecm lacks a cat [14:13:17] @Lucas_WMDE Looks good on the wikis! [14:13:23] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Continuing with sync [14:13:25] alright, thanks! [14:13:38] urbanecm: so does the wiki, reportedly https://bash.toolforge.org/quip/AU7VU7Zh6snAnmqnK_td [14:14:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:14:31] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:14:31] (03PS1) 10Elukey: sre.hosts.provision: make UEFI mandatory [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 [14:16:19] (03PS1) 10Zabe: Close crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214056 (https://phabricator.wikimedia.org/T411501) [14:16:20] (03PS1) 10Zabe: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 [14:16:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214041 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:16:39] (03PS2) 10Zabe: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) [14:18:41] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208357|Growth: Enable Revise Tone feature on pilot wikis (T409606)]] (duration: 13m 03s) [14:18:44] T409606: Release Revise Tone to pilot Wikipedias for opt-in testing (disabled by default) - https://phabricator.wikimedia.org/T409606 [14:19:44] zabe: speaking of closing wikis... do you plan to _create_ one anytime soon? [14:19:48] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11423859 (10tappof) 05Resolved→03Open Due to the issues described in {T410152}, reverting the patch https://gerrit.wikimedia.org/r/1184566... [14:19:49] (03PS5) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [14:19:51] * urbanecm would love to test something on a very fresh wiki [14:20:00] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:20:20] xSavitar: over to you [14:20:32] Lucas_WMDE, thanks! [14:20:45] urbanecm: tokwiki not fresh enough? [14:21:01] (03Merged) 10jenkins-bot: user: Mark users created with User::addToDatabase() as primary [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214041 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:21:03] (03CR) 10CI reject: [V:04-1] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [14:21:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:21:08] urbanecm: there is currently none requested by langcom, not sure if they have something in the pipeline anytime soon [14:21:39] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1214041|user: Mark users created with User::addToDatabase() as primary (T410652)]] [14:21:44] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [14:25:24] !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1214041|user: Mark users created with User::addToDatabase() as primary (T410652)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:25:52] zabe: sounds good. please do feel free to ping me whenever that happens. [14:26:13] !log derick@deploy2002 d3r1ck01, derick: Continuing with sync [14:26:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P86312 and previous config saved to /var/cache/conftool/dbconfig/20251202-142616-marostegui.json [14:26:24] Lucas_WMDE: unfortunately no. i need a wiki that doesn't have the post-creation checklist done [14:26:34] sorry I was too fast [14:26:40] (i want to make GrowthExperiments enabled on Wikipedias by default, but I don't know what will break if I do so) [14:28:03] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:28:47] urbanecm: ok, just checking :) [14:28:56] (03Abandoned) 10Ayounsi: sre.hosts.provision: make UEFI opt-out [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 (owner: 10Ayounsi) [14:28:57] * Lucas_WMDE scolds taavi for being too fast [14:29:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:29:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:30:00] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:30:12] taavi: no worries! being quick is helpful most of the time [14:30:13] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214041|user: Mark users created with User::addToDatabase() as primary (T410652)]] (duration: 08m 34s) [14:30:16] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [14:30:18] this is not time sensitive though :) [14:30:23] xSavitar: let me know when you're done! [14:30:27] urbanecm, over to you. [14:30:40] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: make UEFI mandatory [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [14:32:58] urbanecm: sure will do, will try to not forget :) [14:33:55] (03PS1) 10C. Scott Ananian: WIP: parsoid update [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214061 [14:34:56] ty [14:35:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:37:22] (03PS2) 10Urbanecm: [Growth] Enable Add Link for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) [14:37:27] (03CR) 10Urbanecm: [C:03+2] [Growth] Enable Add Link for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [14:38:10] (03Merged) 10jenkins-bot: [Growth] Enable Add Link for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213988 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [14:38:50] (03PS16) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [14:39:06] (03CR) 10Elukey: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [14:39:42] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1213988|[Growth] Enable Add Link for 3 wikis (T407818)]] [14:39:45] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [14:39:53] (03CR) 10Muehlenhoff: "Mandatory is the wrong wording, should be "Make UEFI the default"?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [14:40:54] (03PS2) 10Elukey: sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 [14:41:08] (03CR) 10Elukey: "Sure, fixed the commit msg." [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [14:41:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86313 and previous config saved to /var/cache/conftool/dbconfig/20251202-144123-marostegui.json [14:41:28] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:41:29] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:41:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [14:41:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86314 and previous config saved to /var/cache/conftool/dbconfig/20251202-144148-marostegui.json [14:41:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1213988|[Growth] Enable Add Link for 3 wikis (T407818)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:43:29] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:45:24] 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11423954 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [14:47:28] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213988|[Growth] Enable Add Link for 3 wikis (T407818)]] (duration: 07m 46s) [14:47:31] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [14:47:37] (03PS1) 10Isabelle Hurbain-Palatin: Bump parsoid to v0.23.0-a7.1 on wmf.4 [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214069 (https://phabricator.wikimedia.org/T411238) [14:48:12] * urbanecm done [14:48:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [14:51:00] !log UTC afternoon backport+config window done [14:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:53] xSavitar: FWIW, logspam-watch doesn’t seem to be showing any reduction in “CAS update failed on user_touched” so far… [14:52:11] (03PS1) 10Isabelle Hurbain-Palatin: Bump parsoid to v0.23.0-a7.1 on wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214070 (https://phabricator.wikimedia.org/T411238) [14:54:33] (03CR) 10C. Scott Ananian: [C:03+2] Bump parsoid to v0.23.0-a7.1 on wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214070 (https://phabricator.wikimedia.org/T411238) (owner: 10Isabelle Hurbain-Palatin) [14:55:09] (03CR) 10C. Scott Ananian: [C:03+2] Bump parsoid to v0.23.0-a7.1 on wmf.4 [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214069 (https://phabricator.wikimedia.org/T411238) (owner: 10Isabelle Hurbain-Palatin) [14:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:58] (03PS10) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [14:56:58] (03PS1) 10Clément Goubert: maintenance::campaignevents: Debug aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1214071 (https://phabricator.wikimedia.org/T411417) [14:58:47] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: thoughts on this? The base class method essentially, adjusted for our use." [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [14:58:48] 06SRE, 10SRE-Access-Requests: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506 (10zoe) 03NEW [14:58:53] (03CR) 10Federico Ceratto: [C:03+2] production-m2.sql.erb: Add requestctl database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214002 (https://phabricator.wikimedia.org/T411111) (owner: 10Federico Ceratto) [14:59:22] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214072 [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1500) [15:05:02] (03CR) 10JHathaway: [C:03+1] Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [15:08:07] (03CR) 10Clément Goubert: "LGTM except 2 nits, will need to test it using `test-cookbook` in a pairing session." [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [15:09:37] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Implement new virt network config structure [puppet] - 10https://gerrit.wikimedia.org/r/1213436 (https://phabricator.wikimedia.org/T411081) (owner: 10Majavah) [15:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:48] (03Merged) 10jenkins-bot: Bump parsoid to v0.23.0-a7.1 on wmf.4 [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214069 (https://phabricator.wikimedia.org/T411238) (owner: 10Isabelle Hurbain-Palatin) [15:10:55] (03CR) 10Daimona Eaytoy: [C:03+1] maintenance::campaignevents: Debug aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1214071 (https://phabricator.wikimedia.org/T411417) (owner: 10Clément Goubert) [15:11:54] (03Merged) 10jenkins-bot: Bump parsoid to v0.23.0-a7.1 on wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214070 (https://phabricator.wikimedia.org/T411238) (owner: 10Isabelle Hurbain-Palatin) [15:12:50] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS bullseye [15:13:03] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11424126 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1088.eqiad.wmnet... [15:13:56] !log upgrade Envoy on Turnilo T405808 [15:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:59] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [15:17:14] (03PS4) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [15:17:40] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1212599 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [15:18:05] (03CR) 10Vgutierrez: [C:03+1] hieradata: temporarily point eqiad LVS at conf1008 [puppet] - 10https://gerrit.wikimedia.org/r/1213601 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [15:18:07] (03PS2) 10Muehlenhoff: Remove udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1212529 [15:19:26] (03CR) 10Elukey: [C:03+1] ipxe MBR support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [15:19:56] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11424146 (10Rmaung) Hi @Aklapper -- That's my mistake, I directed @SEgt-WMF there, that comes from instructions at https://www.mediawiki.org/wiki/Product_... [15:20:13] (03CR) 10Clément Goubert: [C:03+2] maintenance::campaignevents: Debug aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1214071 (https://phabricator.wikimedia.org/T411417) (owner: 10Clément Goubert) [15:20:28] (03CR) 10Muehlenhoff: [C:03+2] Remove udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1212529 (owner: 10Muehlenhoff) [15:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:39] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [15:22:55] (03PS1) 10MVernon: swift: load/drain 5 codfw backends (h/w refresh) [puppet] - 10https://gerrit.wikimedia.org/r/1214077 (https://phabricator.wikimedia.org/T404771) [15:23:54] (03PS1) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) [15:24:30] (03CR) 10CI reject: [V:04-1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [15:25:57] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:26:23] (03CR) 10Eevans: [C:03+1] Deprecate sessionstore-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214022 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [15:27:00] (03PS5) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [15:27:44] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 4 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11424180 (10Jclark-ctr) 05Open→03Resolved a:05bking→03Jclark-ctr [15:29:49] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1530) [15:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:31:40] 07Puppet, 06Infrastructure-Foundations, 10netops, 06serviceops: network::constants::mw_appserver_networks is out of date (or named poorly?) - https://phabricator.wikimedia.org/T411508 (10taavi) 03NEW [15:33:20] (03CR) 10Herron: "To get prometheus metrics from processes that run intermittently I'd recommend using the prometheus pushgateway https://wikitech.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [15:34:45] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11424220 (10elukey) @Eevans I am reasoning out loud, so no need to be sorry, thanks for the follow ups :) The external service should be basically a sort of software LB for hosts outside k... [15:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:24] 07Puppet, 06serviceops: network::constants::mw_appserver_networks is out of date (or named poorly?) - https://phabricator.wikimedia.org/T411508#11424250 (10taavi) [15:41:50] (03CR) 10Herron: [C:03+1] Blackbox/check: strengthen suffix matching regex in generated rules [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) (owner: 10Tiziano Fogli) [15:45:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11424281 (10Jclark-ctr) This drive should be arriving any time today can i swap as soon as it arrives? [15:45:20] (03CR) 10Muehlenhoff: [C:03+2] Deprecate sessionstore-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214022 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [15:45:55] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1088.eqiad.wmnet with OS bullseye [15:45:59] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11424289 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1088.eqiad.wmnet with... [15:46:01] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11424288 (10MoritzMuehlenhoff) [15:46:31] 10ops-codfw, 06DC-Ops: BIOS upgrade for backup2013 & backup2014 - https://phabricator.wikimedia.org/T411511 (10jcrespo) 03NEW [15:47:22] (03CR) 10Clément Goubert: "Needs a chart version bump, otherwise LGTM, and is IMO preferrable to I7574791c5aa2257af481d103df6002960a239ba0" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [15:47:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:48:25] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11424308 (10MoritzMuehlenhoff) Yes, please. I'll take care of the software RAID rebuild. [15:48:28] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11424309 (10Eevans) >>! In T410075#11424220, @elukey wrote: > [ ... ] The external service should be basically a sort of software LB for hosts outside k8s, so you can contact them from k8s... [15:48:57] !log upgrade Envoy on Yarn T405808 [15:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:00] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [15:49:49] 07Puppet, 06serviceops: network::constants::mw_appserver_networks is out of date (or named poorly?) - https://phabricator.wikimedia.org/T411508#11424318 (10Clement_Goubert) p:05Triage→03Low Probably a good task to pair on with @Blake or @jasmine_ on tracking down what's used where in puppet. [15:50:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1251 gradually with 4 steps - Pool db1251.eqiad.wmnet in after cloning [15:53:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:54:18] (03PS1) 10Btullis: Update the spark configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214091 (https://phabricator.wikimedia.org/T410017) [15:55:42] (03PS2) 10Btullis: Update the spark configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214091 (https://phabricator.wikimedia.org/T410017) [15:57:27] (03PS1) 10Muehlenhoff: Drop use of MW_APPSERVER_NETWORKS for ircstream now that mw* servers are gone [puppet] - 10https://gerrit.wikimedia.org/r/1214094 (https://phabricator.wikimedia.org/T411508) [15:57:39] (03PS2) 10Muehlenhoff: Drop use of MW_APPSERVER_NETWORKS for ircstream now that mw* servers are gone [puppet] - 10https://gerrit.wikimedia.org/r/1214094 (https://phabricator.wikimedia.org/T411508) [15:57:50] (03CR) 10Cathal Mooney: [C:03+1] sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [15:58:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:58:22] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on sretest1005.eqiad.wmnet with reason: ipxe [15:59:09] (03CR) 10Clément Goubert: [C:03+1] Drop use of MW_APPSERVER_NETWORKS for ircstream now that mw* servers are gone [puppet] - 10https://gerrit.wikimedia.org/r/1214094 (https://phabricator.wikimedia.org/T411508) (owner: 10Muehlenhoff) [16:00:04] jelto, arnoldokoth, mutante, and arnaudb: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1600) [16:00:24] !log restarting wdqs@codfw - system overloaded [16:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:26] !log jhathaway@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:02:36] !log installing libsndfile security updates [16:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:03:19] FYI, I'll be making a change shortly that will require briefly taking the scap lock to block MediaWiki deployments [16:03:32] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:03:53] (03CR) 10Brouberol: [C:03+1] Update the spark configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214091 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:04:26] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1213600 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:04:33] (03CR) 10Scott French: [C:03+2] hieradata: enable cfssl/pki for etcd on conf1008 [puppet] - 10https://gerrit.wikimedia.org/r/1213600 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:05:48] (03CR) 10Btullis: [C:03+1] opensearch on k8s: add DC-specific records for opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1213580 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [16:06:59] (03PS1) 10Majavah: network: Remove unused cloud_nova_hosts_ranges variable [puppet] - 10https://gerrit.wikimedia.org/r/1214099 [16:08:24] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:08:26] (03PS1) 10MVernon: swift: restore ms-be1088 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1214100 (https://phabricator.wikimedia.org/T404356) [16:08:32] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:08:41] !log migrating etcd to PKI certs on conf1008 - T352245 [16:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:44] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:08:50] !log swfrench@deploy2002 Locking from deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 [16:10:46] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:12:07] (03PS1) 10Kosta Harlan: Refactor: Move editing session ID logic into service [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214101 (https://phabricator.wikimedia.org/T406865) [16:12:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11424434 (10BCornwall) @cmooney Yes, that looks good to me. We can still go for Dec 3 - fee... [16:12:14] (03PS1) 10Kosta Harlan: hCaptcha: Log diff when challenge is presented [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214102 (https://phabricator.wikimedia.org/T406865) [16:12:19] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [16:12:35] !log swfrench@deploy2002 Unlocked for deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 (duration: 03m 45s) [16:12:58] !log installing nodejs security updates [16:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:32] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:13:47] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:13:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11424463 (10BCornwall) @cmooney Yes, that looks good to me. We can still go for Dec 4 - fee... [16:14:22] (03CR) 10Btullis: [C:03+2] Update the spark configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214091 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:14:23] !log begin rolling restarts of eqiad-associated confds - T352245 [16:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:27] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:14:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:15:49] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11424468 (10Jclark-ctr) [16:16:05] (03Merged) 10jenkins-bot: Update the spark configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214091 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:16:20] (03CR) 10SBassett: "Hey @ssingh@wikimedia.org - just wanted to check in and see if a deployment time had been scheduled for this yet. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [16:18:15] !log restarted navtiming on webperf1003 - T352245 [16:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:48] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:19:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:20:30] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [16:22:56] (03Abandoned) 10JHathaway: WIP: iPXE MBR [cookbooks] - 10https://gerrit.wikimedia.org/r/1211169 (owner: 10JHathaway) [16:23:32] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:23:37] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:24:13] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [16:26:51] jouncebot: nowandnext [16:26:51] For the next 0 hour(s) and 33 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1600) [16:26:52] In 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1700) [16:27:26] !log import varnish 7.1.1-2~bpo13+wmf2 into trixie-wikimedia - T401832 [16:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:29] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [16:27:59] I'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1214069 and the associated core patch to wmf.4, any objection to that? [16:28:31] ihurbain: It's already merged. Did you merge to the prod branch without deploying? [16:28:48] ihurbain: Any production deploy since merging would have deployed it immediately. [16:29:23] James_F: *oops* (apparently we fucked up on procedure with cscott) [16:29:27] It looks like https://sal.toolforge.org/log/FCqI35oB8tZ8Ohr0YWqP was the last deploy of MW-land, so this time it won't have broken the world. [16:29:33] so i guess "yes i'm deploying" [16:29:42] But yeah, next time don't merge to wmf.* until you have the deploy conch, please. :-) [16:29:55] nod nod, apologies and duly noted [16:29:59] jouncebot: nowandnext [16:29:59] For the next 0 hour(s) and 30 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1600) [16:30:00] In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1700) [16:30:10] I think SRE Collab aren't using the window right now? [16:30:25] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:31:03] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Remove unused php_version variable [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) [16:31:35] (03CR) 10Jcrespo: [C:03+1] swift: load/drain 5 codfw backends (h/w refresh) [puppet] - 10https://gerrit.wikimedia.org/r/1214077 (https://phabricator.wikimedia.org/T404771) (owner: 10MVernon) [16:32:16] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Remove unused php_version variable [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) [16:32:30] (03CR) 10Jcrespo: [C:03+1] swift: restore ms-be1088 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1214100 (https://phabricator.wikimedia.org/T404356) (owner: 10MVernon) [16:33:26] spiderpig seems to say that nothing is running right now [16:33:42] (03CR) 10MVernon: [C:03+2] swift: load/drain 5 codfw backends (h/w refresh) [puppet] - 10https://gerrit.wikimedia.org/r/1214077 (https://phabricator.wikimedia.org/T404771) (owner: 10MVernon) [16:33:54] (03CR) 10MVernon: [C:03+2] swift: restore ms-be1088 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1214100 (https://phabricator.wikimedia.org/T404356) (owner: 10MVernon) [16:33:57] Yes, they do puppet commits. [16:34:03] `jq 'select(keys | length > 0) | ({"lock": input_filename} + .)' /var/lock/scap*` agrees that no locks are held atm fwiw [16:34:08] (doesn’t cover puppet ofc) [16:34:14] ihurbain: Go for it. [16:34:18] I think you're good to deploy ihurbain [16:34:18] let's goooo [16:34:20] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [16:35:25] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1214069|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]], [[gerrit:1214070|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]] [16:35:29] T411238: Unexpected wikitext changes & whitespace removals by VisualEditor edits - https://phabricator.wikimedia.org/T411238 [16:35:30] T410960: CTT tasks week of 2025-11-21 - https://phabricator.wikimedia.org/T410960 [16:36:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1251 gradually with 4 steps - Pool db1251.eqiad.wmnet in after cloning [16:36:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T410589)', diff saved to https://phabricator.wikimedia.org/P86319 and previous config saved to /var/cache/conftool/dbconfig/20251202-163612-ladsgroup.json [16:36:19] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [16:37:32] !log ihurbain@deploy2002 ihurbain: Backport for [[gerrit:1214069|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]], [[gerrit:1214070|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:38:11] jhathaway@cumin1003 reimage (PID 407378) is awaiting input [16:38:30] (03PS1) 10Muehlenhoff: Remove mediawiki-testers group [puppet] - 10https://gerrit.wikimedia.org/r/1214110 [16:38:32] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:38:39] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:38:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:39:23] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1005.eqiad.wmnet with OS bookworm [16:39:30] !log ihurbain@deploy2002 ihurbain: Continuing with sync [16:40:00] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:42:51] (03PS6) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [16:43:08] !log bking@wmf3062 restart WDQS codfw to resolve lag/possible deadlocks [16:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:43:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:43:36] (03PS15) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) [16:44:46] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214069|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]], [[gerrit:1214070|Bump parsoid to v0.23.0-a7.1 on wmf.4 (T411238 T410960)]] (duration: 09m 21s) [16:44:51] T411238: Unexpected wikitext changes & whitespace removals by VisualEditor edits - https://phabricator.wikimedia.org/T411238 [16:44:52] T410960: CTT tasks week of 2025-11-21 - https://phabricator.wikimedia.org/T410960 [16:46:25] there, done. [16:46:32] (03CR) 10Clément Goubert: [C:03+1] Remove mediawiki-testers group [puppet] - 10https://gerrit.wikimedia.org/r/1214110 (owner: 10Muehlenhoff) [16:47:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:48:22] FYI, I'll be applying some PyBal config changes on LVS hosts in eqiad. there may be some transient alerts that fire as a result (e.g., `PyBal connections to etcd`), which should be suppressed by downtimes, but I'll call them out here in case anything sneaks through. [16:49:22] (03CR) 10Scott French: "Thank you both!" [puppet] - 10https://gerrit.wikimedia.org/r/1213601 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:49:58] (03CR) 10Scott French: [C:03+2] hieradata: temporarily point eqiad LVS at conf1008 [puppet] - 10https://gerrit.wikimedia.org/r/1213601 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:51:05] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [16:51:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86320 and previous config saved to /var/cache/conftool/dbconfig/20251202-165119-ladsgroup.json [16:51:30] (03PS16) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) [16:51:39] (03CR) 10CI reject: [V:04-1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [16:52:28] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11424707 (10herron) [16:52:51] (03CR) 10Matthieulec: "Thanks, I fixed the nits, results of test-cookbook are in comment of the bug" [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [16:53:02] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T352245) [16:53:06] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:53:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:53:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T352245) [16:54:08] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:54:10] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [16:54:11] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:55:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:55:01] marostegui@cumin1003 clone (PID 371956) is awaiting input [16:57:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86321 and previous config saved to /var/cache/conftool/dbconfig/20251202-165702-marostegui.json [16:57:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:57:08] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:58:42] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [16:58:45] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:59:05] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11424827 (10Papaul) @ayounsi @cmooney thanks for the feedback. [16:59:10] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [16:59:23] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11424829 (10Jclark-ctr) [17:00:00] FIRING: [7x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1700). Please do the needful. [17:00:05] Pppery: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:34] (03CR) 10Kamila Součková: [C:03+2] admin: update ssh key for kamila [puppet] - 10https://gerrit.wikimedia.org/r/1213540 (https://phabricator.wikimedia.org/T411404) (owner: 10Kamila Součková) [17:01:56] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [17:02:15] Pppery: hi, around? [17:02:28] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [17:02:33] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11424870 (10Papaul) @ssingh We are planning on doing the first phase(loopback IP change on core routers and management router) of the ULSFO refresh next week D... [17:03:45] !log import varnish-modules 0.20.0-2~deb13+wmf1 into trixie-wikimedia - T401832 [17:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:48] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [17:05:37] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245) [17:05:40] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [17:06:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245) [17:06:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86322 and previous config saved to /var/cache/conftool/dbconfig/20251202-170627-ladsgroup.json [17:07:10] (03PS1) 10Clément Goubert: wikikube-staging: Bum calico memory requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 [17:07:17] (03PS3) 10DDesouza: Deploy 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) [17:07:20] (03PS2) 10Clément Goubert: wikikube-staging: Bump calico memory requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 [17:09:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS trixie [17:10:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:10:20] (03PS2) 10Scott French: hieradata: enable cfssl/pki for etcd on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1213602 (https://phabricator.wikimedia.org/T352245) [17:10:20] (03PS2) 10Scott French: hieradata: point eqiad LVS back to conf1007 [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) [17:10:36] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1005.eqiad.wmnet with OS bookworm [17:10:37] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213602 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [17:11:11] (03PS1) 10DDesouza: Undeploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214126 (https://phabricator.wikimedia.org/T410696) [17:12:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P86323 and previous config saved to /var/cache/conftool/dbconfig/20251202-171210-marostegui.json [17:13:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [17:13:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214126 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [17:15:00] (03PS2) 10DDesouza: [beta] Undeploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214126 (https://phabricator.wikimedia.org/T410696) [17:15:47] rzl: Sorry, I completely lost track of time. Is it too late to deploy my puppet patch now? [17:17:22] Pppery: not too late! but we don't typically put apache config changes into the puppet window, in part because they're in the puppet repo but they're deployed with mediawiki :) [17:17:42] happy to review with you though [17:17:48] OK [17:18:03] Way back when you deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079056 via the puppet window [17:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:21:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T410589)', diff saved to https://phabricator.wikimedia.org/P86324 and previous config saved to /var/cache/conftool/dbconfig/20251202-172134-ladsgroup.json [17:21:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-d8-eqiad cross-rack links incorrect in Netbox - https://phabricator.wikimedia.org/T411480#11425017 (10VRiley-WMF) 05Open→03Resolved Updated cable paths for the new switches in D8 to E1 and F1 [17:21:38] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [17:21:45] I did yeah :) sometimes I will, because I like to be helpful and I can do it even though it's not technically supposed to be part of the window, but other SREs may not be able to [17:21:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [17:22:47] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [17:23:09] Thus I ended up here by analogy with my experience from that patch (since it's almost exactly the same type of patch - changing where URLs for uncreated projects point to via apache) [17:23:18] nod [17:23:34] RECOVERY - MD RAID on bast2003 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:23:40] there isn't a great process here and I understand sometimes the puppet window is the only way to actually get it in front of someone [17:24:46] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:24:49] but in this case since any apache config change is a specialized subject with a high blast radius, and since there won't actually be anything for you to test, it just doesn't make sense [17:25:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:25:11] I'm working on proposing a rework, I'll make sure to consider patches like this [17:25:59] I would say my patch is the antithesis of that; these is something to test (do the URLs I described in the commit message go to where I say they should) and there's not that high a blast radius as it only affects two specific domains. [17:27:10] my point is your change doesn't go live when we merge this to puppet -- it goes live when we subsequently do a helmfile deploy [17:27:16] oh [17:27:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P86325 and previous config saved to /var/cache/conftool/dbconfig/20251202-172717-marostegui.json [17:28:11] I'm comfortable doing that deploy but most SREs won't, and the fact that I happen to also have my name on the puppet window is a coincidence :) [17:29:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [17:29:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11425063 (10VRiley-WMF) 05Open→03In progress removing and updating cables [17:29:31] so, I don't remember exactly but when I did this last time I probably said something like, "I'm not technically supposed to do this, and we can't do it every time, but I'll do it this time anyway, here goes" [17:30:04] I've sometimes been told I shouldn't do stuff like that because it makes people expect it'll always just work! so that's why I'm trying to be a little more careful [17:30:40] I don't want to block you here either, but, again just for clarity, officially this Does Not Belong in the puppet window because it's more complex [17:31:13] "complex" is a subjective term - from my perspective it is a simple change - although I see why from your perspective it isn't [17:31:35] (and of course you probably know, many config changes that were only supposed to have small effects have turned out to knock the site over) [17:31:50] Yeah, I know [17:32:25] this is a blanket statement about apache especially -- *any* change in there will be met with caution and scrutiy [17:32:27] *scrutiny [17:32:31] (03CR) 10JHathaway: UEFI: dup partition on MD RAID boxes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [17:33:20] https://wikitech.wikimedia.org/w/index.php?title=Puppet_request_window&diff=prev&oldid=2367508 [17:33:41] (03CR) 10Hnowlan: [C:03+1] sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [17:34:11] I was just typing -- if you want to add an httpbb test for this change (end of that line that you're looking at) it's a good way to make this easier [17:34:41] for example makes it easy to verify that those two deletions have two different effects [17:35:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:37:13] (03CR) 10Matthieulec: [C:03+1] sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [17:37:25] anyway I'm sorry this is a pain, and like I say I'm working on coming up with a new process which will replace the puppet request window and hopefully be easier for everyone [17:38:33] I'm writing tests now, although of course they will start failing if langcom approves a nb.wikiversity or nb.wikivovyage [17:40:01] (03CR) 10Hnowlan: [C:03+2] "+2ing for Matthieu as he's not in `ops`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [17:40:39] nod, that's fine -- you're right, it'll surprise us when that happens, but we can track it down -- and in the meantime we know this is a change we want to test, so on balance it's good [17:40:59] (03PS6) 10Pppery: Remove bad Norwegian funnels [puppet] - 10https://gerrit.wikimedia.org/r/1208442 (https://phabricator.wikimedia.org/T407553) [17:41:07] Hope I did that right [17:41:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11425144 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF this is completed [17:42:02] (03CR) 10Hnowlan: "+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [17:42:17] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11425151 (10MLechvien-WMF) 05Open→03Resolved [17:42:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86326 and previous config saved to /var/cache/conftool/dbconfig/20251202-174225-marostegui.json [17:42:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:42:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:42:34] cool, thank you - looking, just a moment [17:42:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:42:45] (03PS7) 10Pppery: Remove bad Norwegian funnels [puppet] - 10https://gerrit.wikimedia.org/r/1208442 (https://phabricator.wikimedia.org/T407553) [17:42:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86327 and previous config saved to /var/cache/conftool/dbconfig/20251202-174249-marostegui.json [17:43:05] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [17:44:44] (03PS1) 10AOkoth: vrts: re-enable cache cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1214129 (https://phabricator.wikimedia.org/T411452) [17:44:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1251.eqiad.wmnet onto db1169.eqiad.wmnet [17:46:49] Pppery: those tests are both for nb.wikivoyage.org, otherwise LGTM [17:47:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [17:47:10] Whoops, meant the second one to be for nb.wikiversity.org [17:47:13] Fixed that now [17:47:14] swfrench-wmf: are you using the infra window today? [17:47:17] (03PS8) 10Pppery: Remove bad Norwegian funnels [puppet] - 10https://gerrit.wikimedia.org/r/1208442 (https://phabricator.wikimedia.org/T407553) [17:47:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [17:47:30] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [17:47:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86328 and previous config saved to /var/cache/conftool/dbconfig/20251202-174732-marostegui.json [17:47:37] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:47:38] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:48:27] rzl: there's a ~ 15 minute span where I'll need to block mediawiki deployments during some etcd maintenance work, but we can easily coordinate around that. [17:48:40] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [17:49:07] okay cool, thank you <3 I might be able to get this wrapped up before the top of the hour, but if I run a few minutes over, is that okay? or would you rather I pause for your go-ahead? [17:49:30] (03CR) 10RLazarus: [C:03+2] Remove bad Norwegian funnels [puppet] - 10https://gerrit.wikimedia.org/r/1208442 (https://phabricator.wikimedia.org/T407553) (owner: 10Pppery) [17:50:00] rzl: no need to rush - I have setup work to complete first, so it's totally fine if your deployment runs into the window :) [17:50:09] rad, thank you, I'll let you know when I'm clear [17:50:15] * swfrench-wmf thumbs up [17:50:17] 10ops-eqiad, 06DC-Ops: Reclaim components from decommed servers - https://phabricator.wikimedia.org/T411533 (10VRiley-WMF) 03NEW [17:50:27] (03PS1) 10Btullis: Correct an error in the selector for external-services in analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214130 (https://phabricator.wikimedia.org/T406833) [17:50:35] (puppet merging now -- httpbb failures are expected until the helmfile deploy is done) [17:51:45] haha I forgot about puppet on the deploy host 🙃 so much for the top of the hour [17:52:01] lol [17:55:11] (03CR) 10Ssingh: [C:03+1] hieradata: point eqiad LVS back to conf1007 [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [17:57:04] (03CR) 10JHathaway: [C:03+1] P:postfix::mx: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1212139 (owner: 10Majavah) [17:57:27] (03PS2) 10Majavah: P:postfix::mx: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1212139 [17:59:27] (03CR) 10Majavah: [C:03+2] P:postfix::mx: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1212139 (owner: 10Majavah) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T1800) [18:00:27] helmfile diffs look good [18:00:29] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1208442 T407553 [18:00:31] T407553: nb.wikiversity.org redirects to 404 page on BetaWikiversity - https://phabricator.wikimedia.org/T407553 [18:01:28] !log silenced EtcdReplicationDown (42a82757-2075-44fd-b057-ec9ed2afeb90) - T352245 [18:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:31] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:01:39] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1208442 T407553 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:02:05] Pppery: go ahead and test with the debug servers, but the nice thing is your httpbb tests just passed :) [18:02:23] Works like I expected [18:02:26] 👍 [18:02:28] !log rzl@deploy2002 rzl: Continuing with sync [18:04:10] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11425341 (10SEgt-WMF) In case it is useful: the MediaWiki page @Rmaung pointed out was the first thing that came up for me when I googled "Request Access... [18:04:28] !log manually transferred codfw etcd replication source to conf1008 - T352245 [18:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:10] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1208442 T407553 (duration: 06m 36s) [18:06:13] T407553: nb.wikiversity.org redirects to 404 page on BetaWikiversity - https://phabricator.wikimedia.org/T407553 [18:07:10] Thank you for dealing with my patch despite me technically going through the wrong process. Since i only see to do this once a year this probably won't come up again. [18:07:20] all set! thanks Pppery for your patience, I promise we're working on something better here [18:07:41] it isn't your fault -- the problem is we don't have a right process, only a few different wrong ones :( [18:07:44] (03CR) 10Btullis: [C:03+2] Correct an error in the selector for external-services in analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214130 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [18:08:02] swfrench-wmf: I'm hands-off, it's all yours -- thanks for your patience too [18:08:02] (03CR) 10Scott French: [C:03+2] hieradata: enable cfssl/pki for etcd on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1213602 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:08:18] thanks, rzl! no worries at all :) [18:08:45] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS bookworm [18:09:22] (03PS1) 10Aleksandar Mastilovic: Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) [18:09:26] (03Merged) 10jenkins-bot: Correct an error in the selector for external-services in analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214130 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [18:10:22] !log swfrench@deploy2002 Locking from deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 [18:10:25] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:10:46] RECOVERY - MD RAID on ms-fe2014 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:12:18] !log migrating etcd to PKI certs on conf1009 - T352245 [18:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:04] (03CR) 10Btullis: [C:03+1] Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [18:15:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [18:15:25] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [18:16:34] !log manually transferred etcd replication source back to conf1009 - T352245 [18:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:41] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:19:46] !log deleted EtcdReplicationDown silence (42a82757-2075-44fd-b057-ec9ed2afeb90) - T352245 [18:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11425379 (10BTullis) OK, thanks. You can go ahead and swap these. I found out which drivers were showing errors, as far as the kernel is concerned: ` sudo... [18:20:04] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11425382 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium RAID is rebuilt, resolving. [18:22:20] !log migrating etcd to PKI certs on conf1007 - T352245 [18:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:23] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:23:49] !log begin rolling restarts of eqiad-associated confds - T352245 [18:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:58] (03CR) 10JavierMonton: [C:03+1] Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [18:24:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:24:16] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11425408 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [18:26:51] !log restarted navtiming on webperf1003 - T352245 [18:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] !log swfrench@deploy2002 Unlocked for deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 (duration: 17m 35s) [18:28:00] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:29:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:32:25] (03CR) 10Scott French: "Thank you both!" [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:32:32] (03CR) 10Scott French: [C:03+2] hieradata: point eqiad LVS back to conf1007 [puppet] - 10https://gerrit.wikimedia.org/r/1213603 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:32:37] !log repool ms-fe2014 T410959 [18:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] T410959: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959 [18:33:24] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11425448 (10MatthewVernon) @Jhancock.wm RAID rebuilt OK, server back in production. Thanks for your help here :) [18:36:15] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T352245) [18:36:18] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:36:39] (03PS1) 10Btullis: Correct the external-services definition for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214135 (https://phabricator.wikimedia.org/T406833) [18:36:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T352245) [18:39:12] (03CR) 10Btullis: [C:03+2] Correct the external-services definition for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214135 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [18:40:11] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [18:40:51] (03Merged) 10jenkins-bot: Correct the external-services definition for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214135 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [18:41:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [18:41:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:41:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [18:41:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [18:41:56] 07Puppet, 06cloud-services-team, 10Horizon: Allow providing a commit message for hieradata changes - https://phabricator.wikimedia.org/T250623#11425504 (10taavi) p:05Triage→03Low [18:43:09] (03CR) 10Brouberol: [C:04-1] "Nit: the email address should be between `<>`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [18:46:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:46:35] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11425541 (10Jclark-ctr) [18:46:51] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11425543 (10ssingh) >>! In T408892#11424869, @Papaul wrote: > @ssingh We are planning on doing the first phase(loopback IP change on core routers and managemen... [18:47:33] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [18:47:36] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:47:56] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [18:48:05] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11425549 (10Jclark-ctr) a:03Jhancock.wm @Jhancock.wm I have cleaned up old switches in eqiad. The remaining are in codfw [18:52:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS trixie [18:53:29] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245) [18:53:33] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [18:53:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245) [18:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:58:40] (03PS5) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [18:59:19] (03PS1) 10Btullis: Use the correct calico selector syntax for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214141 (https://phabricator.wikimedia.org/T406833) [18:59:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11425595 (10bking) 05Open→03Resolved a:03bking Thanks again for your patience. I've added a mini-essay on why I think it's safe to... [19:02:13] (03CR) 10Btullis: [C:03+2] Use the correct calico selector syntax for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214141 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [19:03:37] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3010*} and A:liberica [19:03:49] (03Merged) 10jenkins-bot: Use the correct calico selector syntax for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214141 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [19:05:43] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [19:07:41] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3010*} and A:liberica [19:12:40] (03PS6) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [19:15:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [19:15:20] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [19:15:47] (03CR) 10Scott French: [C:03+1] conf/eqiad: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1182694 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [19:16:26] (03PS2) 10Aleksandar Mastilovic: Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) [19:16:33] (03CR) 10Aleksandar Mastilovic: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [19:17:50] (03CR) 10Btullis: [C:03+1] Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [19:18:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543 (10medelius) 03NEW [19:19:08] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [19:29:36] (03PS1) 10Aaron Schulz: Update Math API title and project-specific /math/ endpoint stability policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) [19:29:37] (03PS1) 10Aaron Schulz: Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 [19:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:30:54] (03PS2) 10Aaron Schulz: Update Math API title and project-specific /math/ endpoint stability policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) [19:31:13] jouncebot: now [19:31:13] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [19:33:17] FIRING: [2x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:34:51] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [19:37:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:38:17] FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:42:37] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:42:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:43:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:43:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:44:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:45:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:46:27] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb.codfw.wikimedia.org T365259 - dzahn@cumin2002" [19:48:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:48:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb.codfw.wikimedia.org T365259 - dzahn@cumin2002" [19:48:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:42] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:49:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:50:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:52:24] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:53:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:52] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [19:55:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:55:00] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:56:55] (03CR) 10Bking: [C:03+2] ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [19:57:44] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:58:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:58:38] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [20:00:05] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:01:57] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#11425775 (10Scott_French) Thanks for the heads-up, @Marostegui. From a quick skim of the history here, agreed that it should now be straightf... [20:03:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:05:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:05:00] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:05:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:05:05] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:05:22] thanos acting up again I see [20:05:46] denisse: ^ is this known to olly? we also saw it yesterday so I was wondering if I should file a task [20:06:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:06:15] sukhe: thanos got OOM-killed [20:06:29] the same happened to titan1001 on Friday and Monday as well [20:06:37] moritzm: ah. yeah, probably even yesterday night [20:06:45] I didn't look at that time but it was acting up [20:06:49] I will flag for olly [20:06:50] Let me look. [20:06:57] Yeah, it's known. [20:06:57] <3 [20:07:01] ok [20:07:07] Let me find the task. [20:07:11] Tiziano reverted a patch which reduced the retention time to how it was before [20:07:35] but probably due to the still increased volume in flight, this still triggered [20:07:45] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:08:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:08:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1213487 was the revert [20:08:17] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:31] My laptop froze, I'll have to reboot. [20:09:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS trixie [20:10:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:10:00] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:18] (03PS1) 10Bartosz Dziewoński: CentralAuthUser: Add debugging information for T385310 [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214146 (https://phabricator.wikimedia.org/T385310) [20:10:33] (03PS1) 10Bartosz Dziewoński: CentralAuthUser: Add debugging information for T385310 [extensions/CentralAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214147 (https://phabricator.wikimedia.org/T385310) [20:10:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214146 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [20:10:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214147 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [20:11:50] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11425792 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm np! [20:12:36] I'm back. [20:12:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS trixie [20:12:47] denisse: no worries, it resolved [20:13:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:13:07] * denisse wondering if we need to revert the revert. [20:13:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:40] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:15:00] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:15:02] (03CR) 10VolkerE: [C:04-1] "This looks technically good. Only one small naming amendment request inside." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [20:15:08] Let me dig deeper into the Thanos issue, we've had several instances of it being OOM killed for various reasons. I need to investigate if it's a query of death, the retention time, etc. [20:15:39] thank you [20:16:05] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11425798 (10Jhancock.wm) @Dwisehaupt did you get my email about the password? [20:16:25] FIRING: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:28] Related: https://phabricator.wikimedia.org/T411343 [20:16:35] (03PS1) 10Krinkle: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) [20:16:37] (03PS1) 10Krinkle: robots.txt: Clean up inline comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 [20:16:37] (03PS1) 10Krinkle: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 [20:17:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [20:18:17] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:39] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for esams and ulsfo T365259 - dzahn@cumin2002" [20:18:44] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for esams and ulsfo T365259 - dzahn@cumin2002" [20:18:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:51] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS bookworm [20:21:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:21:57] (03PS7) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [20:23:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:26:06] jouncebot: nowandnext [20:26:06] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [20:26:07] In 0 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T2100) [20:26:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:26:41] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [20:26:48] I'm going to sync a patch now, if that's alright [20:26:50] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:27:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:27:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214101 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [20:28:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214102 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [20:28:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:28:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS trixie [20:28:44] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:29:26] (03Merged) 10jenkins-bot: Refactor: Move editing session ID logic into service [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214101 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [20:29:29] (03Merged) 10jenkins-bot: hCaptcha: Log diff when challenge is presented [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214102 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [20:29:45] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [20:30:02] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1214101|Refactor: Move editing session ID logic into service (T406865)]], [[gerrit:1214102|hCaptcha: Log diff when challenge is presented (T406865)]] [20:30:05] T406865: hCaptcha: Implement mechanism to log about-to-be-published content when challenge is presented - https://phabricator.wikimedia.org/T406865 [20:31:03] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for drmrs and eqsin T365259 - dzahn@cumin2002" [20:31:08] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for drmrs and eqsin T365259 - dzahn@cumin2002" [20:31:08] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:32:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:32:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:33:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:33:21] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:34:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [20:37:13] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for magru and eqiad T365259 - dzahn@cumin2002" [20:37:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP for magru and eqiad T365259 - dzahn@cumin2002" [20:37:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:38:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:38:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [20:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:41:25] RESOLVED: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:42] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11425892 (10Jhancock.wm) 05Open→03Resolved a:05Jhancock.wm→03Jclark-ctr already had a separate task but thank you! [20:42:55] (03PS8) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [20:42:55] (03PS1) 10JHathaway: reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 [20:43:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:43:17] FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:44:50] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:45:26] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan) - https://phabricator.wikimedia.org/T411365#11425904 (10andrea.denisse) Hi Hugh, this is on the clinic duty dashboard. Since the patch is merged I was wondering if there's anything else to do or if we should close as resolved. [20:46:44] (03Abandoned) 10C. Scott Ananian: WIP: parsoid update [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214061 (owner: 10C. Scott Ananian) [20:47:01] 06SRE, 10SRE-Access-Requests: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11425906 (10andrea.denisse) Hi Raine, this is on the clinic duty dashboard. Since the patch is merged I was wondering if there's anything else I can assist with or if we should close as resolved. [20:48:02] RESOLVED: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:48:17] RESOLVED: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:17] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11425908 (10andrea.denisse) a:03andrea.denisse [20:48:25] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:48:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:48:55] 06SRE, 10SRE-Access-Requests: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11425909 (10andrea.denisse) a:03andrea.denisse [20:49:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11425910 (10andrea.denisse) a:03andrea.denisse [20:52:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:52:06] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for codfw and eqiad T365259 - dzahn@cumin2002" [20:52:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for codfw and eqiad T365259 - dzahn@cumin2002" [20:52:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:52:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [20:57:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:57:17] FIRING: ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:53] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:00:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS trixie [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T2100). [21:00:05] aude, katherine_g, danisztls, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] o/ [21:00:24] o/ [21:00:31] my patch has a last minute -1 and we want to enable the config for testwiki, so we will move ours to tomorrow [21:00:36] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS trixie [21:01:16] hi [21:01:30] any deployers around? i can't self-deploy [21:02:04] I can deploy [21:02:17] RESOLVED: ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:29] I'll deploy mine first unless they can all go together? [21:03:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:03:24] Go ahead katherine_g , I can do MatmaRex's patches when you're done (and danisztls's too if he needs me to) [21:03:30] (03PS2) 10Aaron Schulz: Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 [21:03:35] sounds good! deploying now [21:03:39] (03PS3) 10Aaron Schulz: Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) [21:03:51] RoanKattouw: Thanks! I can self-deploy after katherine_g. [21:03:53] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for drmrs, eqsin and esams T365259 - dzahn@cumin2002" [21:03:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for drmrs, eqsin and esams T365259 - dzahn@cumin2002" [21:03:59] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:04:37] danisztls: Great! Please ping me when you're done then, and I can go last [21:04:39] RoanKattouw: it looks like I can no longer deploy do you mind deploying mine? [21:04:51] Sure np [21:05:17] ty! [21:06:22] Oh I think the issue is that kostajh is currently doing a deployment, which is moving very slowly (it's been running for 36 minutes with no log output for the past 35 minutes) [21:06:33] sorry, let me finish up [21:06:47] although, the container images are still building [21:06:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:06:55] ah.. because of the i18n changes [21:06:57] sorry :/ [21:06:59] Yeah it's been stuck on that step for more than half an hour [21:07:00] should be done soon-is [21:07:01] Ah I see [21:07:04] *soon-ish [21:07:11] ok, i'll try again in a few min [21:07:14] Yeah those are always slow [21:07:40] katherine_g: You should see logmsgbot post here when that deploy is done, and you can also watch its progress at https://spiderpig.wikimedia.org/jobs/1025 [21:09:14] (03PS3) 10LorenMora: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) [21:10:01] (03CR) 10CI reject: [V:04-1] [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:10:22] (03CR) 10Aude: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:10:43] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:11:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:11:54] ok, it's moving forward now [21:13:57] (03PS4) 10LorenMora: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) [21:14:10] 10ops-codfw, 10ops-eqiad, 06DC-Ops: Offline Script not completing - https://phabricator.wikimedia.org/T411551 (10Jhancock.wm) 03NEW [21:14:33] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for ulsfo and magru T365259 - dzahn@cumin2002" [21:14:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added gerrit-lb VIP - IPv6 - for ulsfo and magru T365259 - dzahn@cumin2002" [21:14:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:14:41] (03CR) 10LorenMora: [Legal Footer] Create config for adding legal footer (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:14:48] (03CR) 10CI reject: [V:04-1] [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:15:55] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1214101|Refactor: Move editing session ID logic into service (T406865)]], [[gerrit:1214102|hCaptcha: Log diff when challenge is presented (T406865)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:57] (03CR) 10BPirkle: [C:03+1] Update Math API title and project-specific /math/ endpoint stability policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:15:58] T406865: hCaptcha: Implement mechanism to log about-to-be-published content when challenge is presented - https://phabricator.wikimedia.org/T406865 [21:16:03] (03PS5) 10LorenMora: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) [21:16:38] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [21:17:02] !log kharlan@deploy2002 kharlan: Continuing with sync [21:17:21] 10ops-codfw, 06SRE, 06DC-Ops: codfw: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411195#11426042 (10Jhancock.wm) i fixed the one in d5, but gonna physically inspect the ones in c8 cause of an overabundance of caution. [21:17:25] 10ops-codfw, 06SRE, 06DC-Ops: codfw: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411195#11426043 (10Jhancock.wm) a:03Jhancock.wm [21:19:43] (03CR) 10Aude: [C:03+1] "looks good, tested this locally with the changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:20:17] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [21:20:28] (03PS1) 10Krinkle: varnish: De-duplicate mediawiki::errorpage options and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/1214155 [21:20:28] (03PS1) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 [21:20:36] 10ops-eqiad, 06SRE, 06DC-Ops: Reclaim components from decommed servers - https://phabricator.wikimedia.org/T411533#11426057 (10wiki_willy) Swap out R430 spare drives with newer drives (1 for 1 swap), along with memory [21:23:44] (03PS2) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T411552) [21:25:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:26:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS trixie [21:26:55] (03CR) 10BPirkle: "Is the data-parsoid response format still needed? I see it still present, but not referenced (at least, in the "default" spec, that's the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:29:08] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214101|Refactor: Move editing session ID logic into service (T406865)]], [[gerrit:1214102|hCaptcha: Log diff when challenge is presented (T406865)]] (duration: 59m 06s) [21:29:11] T406865: hCaptcha: Implement mechanism to log about-to-be-published content when challenge is presented - https://phabricator.wikimedia.org/T406865 [21:29:39] ok, deploying mine now [21:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [21:31:14] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:31:17] (03Merged) 10jenkins-bot: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [21:31:48] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1207932|Enable revertrisk filters in thwiki (T409438)]] [21:31:51] T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 [21:32:02] (03CR) 10Bking: [C:03+2] opensearch on k8s: add DC-specific records for opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1213580 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [21:32:31] thanks, sorry that took a while! [21:32:38] !log bking@dns1004 START - running authdns-update [21:32:53] kostajh: np [21:33:38] !log bking@dns1004 END - running authdns-update [21:34:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:34:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:35:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:36:03] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1207932|Enable revertrisk filters in thwiki (T409438)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:38:24] !log kgraessle@deploy2002 kgraessle: Continuing with sync [21:38:34] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2013'] [21:38:46] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2013'] [21:39:47] (03PS3) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) [21:42:22] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207932|Enable revertrisk filters in thwiki (T409438)]] (duration: 10m 34s) [21:42:28] T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 [21:43:01] RoanKattouw: mine is finished [21:43:18] danisztls: Your turn [21:43:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214126 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:43:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:43:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:43:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:44:01] RoanKattouw: ok [21:44:19] (03Merged) 10jenkins-bot: [beta] Undeploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214126 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:44:22] (03Merged) 10jenkins-bot: Deploy 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:44:24] PROBLEM - Host backup2013 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:55] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1214126|[beta] Undeploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1213123|Deploy 2025 Global Readers Survey (non-enwiki) (T410918)]] [21:45:00] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [21:45:01] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:47:17] !log dani@deploy2002 dani: Backport for [[gerrit:1214126|[beta] Undeploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1213123|Deploy 2025 Global Readers Survey (non-enwiki) (T410918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:51:16] !log dani@deploy2002 dani: Continuing with sync [21:52:16] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:52:17] (03CR) 10Aaron Schulz: "data-parsoid is mentioned in the /page/html/ endpoints though it's not used as a schema. I wasn't sure, but I guess it's fine to remove. I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:54:38] RECOVERY - Host backup2013 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [21:55:19] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214126|[beta] Undeploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1213123|Deploy 2025 Global Readers Survey (non-enwiki) (T410918)]] (duration: 10m 23s) [21:55:24] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [21:55:25] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:55:37] RoanKattouw: your turn [21:57:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214146 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [21:57:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214147 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [21:59:18] (03CR) 10JHathaway: nftables::service: Improve src/dst filter handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:59:43] (03CR) 10Catrope: [C:04-1] OATHAuth: Expand 2FA to all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251202T2200) [22:01:10] (03Merged) 10jenkins-bot: CentralAuthUser: Add debugging information for T385310 [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214146 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [22:01:11] (03Merged) 10jenkins-bot: CentralAuthUser: Add debugging information for T385310 [extensions/CentralAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214147 (https://phabricator.wikimedia.org/T385310) (owner: 10Bartosz Dziewoński) [22:01:48] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1214146|CentralAuthUser: Add debugging information for T385310 (T385310)]], [[gerrit:1214147|CentralAuthUser: Add debugging information for T385310 (T385310)]] [22:01:51] T385310: Could not find local user data for {username}@{wikiId} (2025) - https://phabricator.wikimedia.org/T385310 [22:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:31] !log catrope@deploy2002 catrope, matmarex: Backport for [[gerrit:1214146|CentralAuthUser: Add debugging information for T385310 (T385310)]], [[gerrit:1214147|CentralAuthUser: Add debugging information for T385310 (T385310)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:04:39] MatmaRex: Please tet [22:04:41] *test [22:05:02] nothing to test on mwdebug, this change only adds some logging for production [22:05:11] OK then I will proceed [22:05:14] thanks for deploying btw RoanKattouw :) [22:05:17] !log catrope@deploy2002 catrope, matmarex: Continuing with sync [22:05:44] (03CR) 10CDanis: [C:03+1] UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [22:06:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86329 and previous config saved to /var/cache/conftool/dbconfig/20251202-220600-marostegui.json [22:06:05] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:06:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:07:53] (03PS4) 10Aaron Schulz: Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) [22:09:17] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214146|CentralAuthUser: Add debugging information for T385310 (T385310)]], [[gerrit:1214147|CentralAuthUser: Add debugging information for T385310 (T385310)]] (duration: 07m 29s) [22:09:20] T385310: Could not find local user data for {username}@{wikiId} (2025) - https://phabricator.wikimedia.org/T385310 [22:09:27] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS trixie [22:10:06] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:02] 10ops-codfw, 06SRE, 06DC-Ops: BIOS upgrade for backup2013 & backup2014 - https://phabricator.wikimedia.org/T411511#11426255 (10Jhancock.wm) [22:18:11] 10ops-codfw, 06SRE, 06DC-Ops: BIOS upgrade for backup2013 & backup2014 - https://phabricator.wikimedia.org/T411511#11426260 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @jcrespo these two servers have been had their bios upgraded to 2.7.5. please let us know if you have any other issues with t... [22:20:34] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS trixie [22:21:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P86330 and previous config saved to /var/cache/conftool/dbconfig/20251202-222107-marostegui.json [22:23:47] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.eqiad.wmnet with OS trixie [22:25:13] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.eqiad.wmnet with OS trixie [22:32:27] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:33:08] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:36:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P86331 and previous config saved to /var/cache/conftool/dbconfig/20251202-223615-marostegui.json [22:38:54] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [22:39:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1001.eqiad.wmnet with reason: host reimage [22:40:16] cloudrabbit is a cool hostname [22:41:56] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:42:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [22:45:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1001.eqiad.wmnet with reason: host reimage [22:45:18] (03CR) 10BPirkle: [C:03+1] Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [22:47:36] dzahn@cumin2002 netbox (PID 1543985) is awaiting input [22:50:16] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:20] (03PS4) 10RLazarus: aux_k8s: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 [22:51:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86332 and previous config saved to /var/cache/conftool/dbconfig/20251202-225122-marostegui.json [22:51:28] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:51:29] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:57:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [22:58:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86333 and previous config saved to /var/cache/conftool/dbconfig/20251202-225809-marostegui.json [22:58:21] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:58:22] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:00:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.eqiad.wmnet with OS trixie [23:01:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.eqiad.wmnet with OS trixie [23:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1001.eqiad.wmnet with OS trixie [23:05:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:05:34] (03PS1) 10JHathaway: admin: add fido backed ssh keys for jhathaway [puppet] - 10https://gerrit.wikimedia.org/r/1214169 [23:08:06] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:13:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P86334 and previous config saved to /var/cache/conftool/dbconfig/20251202-231317-marostegui.json [23:13:30] (03CR) 10VolkerE: [C:03+1] [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [23:15:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:17:32] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.eqiad.wmnet with reason: host reimage [23:18:10] (03PS2) 10Zabe: Close crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214056 (https://phabricator.wikimedia.org/T411501) [23:18:40] (03PS3) 10Zabe: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) [23:22:57] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: move IPv6 gerrit-lb to IPs ending in ::2 T365259 - dzahn@cumin2002" [23:23:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: move IPv6 gerrit-lb to IPs ending in ::2 T365259 - dzahn@cumin2002" [23:23:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:23:39] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.eqiad.wmnet with reason: host reimage [23:24:55] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11426444 (10Papaul) @ssingh yes we have to depool the site, yes 10 AM CT [23:28:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P86335 and previous config saved to /var/cache/conftool/dbconfig/20251202-232824-marostegui.json [23:29:30] (03Abandoned) 10Aleksandar Mastilovic: Add GRANT MODIFYs to aqsloader for two new pageviews tables [puppet] - 10https://gerrit.wikimedia.org/r/1213571 (https://phabricator.wikimedia.org/T410962) (owner: 10Aleksandar Mastilovic) [23:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:41:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.eqiad.wmnet with OS trixie [23:41:40] (03PS1) 10Dzahn: geo-resources: add gerrit-addrs resource [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) [23:43:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86336 and previous config saved to /var/cache/conftool/dbconfig/20251202-234332-marostegui.json [23:43:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [23:43:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:43:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [23:43:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86337 and previous config saved to /var/cache/conftool/dbconfig/20251202-234356-marostegui.json [23:45:14] (03CR) 10Dzahn: "just asking for a sanity check, but separately for a deploy.. if anyone feels like just deploying it too, don't hold back." [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [23:47:08] (03PS1) 10Zabe: Update composer dependencies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214178 [23:53:27] (03PS1) 10Dzahn: dns.admin: add gerrit-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) [23:58:38] jouncebot: nowandnext [23:58:38] No deployments scheduled for the next 7 hour(s) and 1 minute(s) [23:58:38] In 7 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0700) [23:58:50] (03CR) 10Zabe: [C:03+2] Update composer dependencies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214178 (owner: 10Zabe) [23:59:37] (03Merged) 10jenkins-bot: Update composer dependencies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214178 (owner: 10Zabe) [23:59:54] (03CR) 10Zabe: [C:03+2] Close crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214056 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe)