[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0000) [00:02:25] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1228603 [00:02:31] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1228604 [00:02:37] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1228605 [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:55] PROBLEM - MegaRAID on db1171 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:15:56] ACKNOWLEDGEMENT - MegaRAID on db1171 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T415001 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:16:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001 (10ops-monitoring-bot) 03NEW [00:30:29] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1187 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [00:30:31] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1187 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T415002 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [00:30:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002 (10ops-monitoring-bot) 03NEW [00:39:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228606 [00:39:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228606 (owner: 10TrainBranchBot) [00:51:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228606 (owner: 10TrainBranchBot) [00:54:12] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:41] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 03s) [01:10:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228607 [01:10:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228607 (owner: 10TrainBranchBot) [01:19:12] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:32:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228607 (owner: 10TrainBranchBot) [01:33:13] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1208) taken on 2026-01-20 01:08:03 is 506 MiB, but the previous one was 415 MiB, a change of +22.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:01:15] PROBLEM - dump of s1 in codfw on backupmon1001 is CRITICAL: Last dump for s1 at codfw (db2141) taken on 2026-01-20 00:00:02 is 153 GiB, but the previous one was 183 GiB, a change of -16.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:09:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.12 [core] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1228627 (https://phabricator.wikimedia.org/T413803) [02:09:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.12 [core] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1228627 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [02:21:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.12 [core] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1228627 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [02:30:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T410589)', diff saved to https://phabricator.wikimedia.org/P87766 and previous config saved to /var/cache/conftool/dbconfig/20260120-023047-ladsgroup.json [02:30:55] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:40:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P87767 and previous config saved to /var/cache/conftool/dbconfig/20260120-024056-ladsgroup.json [02:51:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P87768 and previous config saved to /var/cache/conftool/dbconfig/20260120-025103-ladsgroup.json [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0300) [03:01:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T410589)', diff saved to https://phabricator.wikimedia.org/P87769 and previous config saved to /var/cache/conftool/dbconfig/20260120-030112-ladsgroup.json [03:01:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:01:28] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [03:01:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T410589)', diff saved to https://phabricator.wikimedia.org/P87770 and previous config saved to /var/cache/conftool/dbconfig/20260120-030136-ladsgroup.json [03:09:55] PROBLEM - Host an-druid1005 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3014.22 ms [03:10:15] RECOVERY - Host an-druid1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [03:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0400) [04:00:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87771 and previous config saved to /var/cache/conftool/dbconfig/20260120-040055-marostegui.json [04:01:04] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:01:04] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:02:00] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228694 (https://phabricator.wikimedia.org/T413803) [04:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228694 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [04:02:58] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228694 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [04:03:27] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.12 refs T413803 [04:03:32] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P87772 and previous config saved to /var/cache/conftool/dbconfig/20260120-041104-marostegui.json [04:21:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P87773 and previous config saved to /var/cache/conftool/dbconfig/20260120-042112-marostegui.json [04:31:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87774 and previous config saved to /var/cache/conftool/dbconfig/20260120-043120-marostegui.json [04:31:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:31:27] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:31:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [04:31:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87775 and previous config saved to /var/cache/conftool/dbconfig/20260120-043145-marostegui.json [04:47:42] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.12 refs T413803 (duration: 44m 14s) [04:47:46] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [04:54:12] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0500) [05:03:25] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.7 (duration: 03m 23s) [05:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:13] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:18:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2240 with weight 0 T414543', diff saved to https://phabricator.wikimedia.org/P87776 and previous config saved to /var/cache/conftool/dbconfig/20260120-061840-marostegui.json [06:18:45] T414543: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T414543 [06:19:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 42 hosts with reason: Primary switchover s4 T414543 [06:19:45] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226510 (https://phabricator.wikimedia.org/T414543) (owner: 10Gerrit maintenance bot) [06:25:14] !log Starting s4 codfw failover from db2179 to db2240 - T414543 [06:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:18] T414543: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T414543 [06:25:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T414543', diff saved to https://phabricator.wikimedia.org/P87777 and previous config saved to /var/cache/conftool/dbconfig/20260120-062527-marostegui.json [06:25:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2240 to s4 primary and set section read-write T414543', diff saved to https://phabricator.wikimedia.org/P87778 and previous config saved to /var/cache/conftool/dbconfig/20260120-062551-marostegui.json [06:26:12] (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1226512 (https://phabricator.wikimedia.org/T414543) (owner: 10Gerrit maintenance bot) [06:26:16] (03CR) 10Ayounsi: [C:03+1] plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1228518 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [06:26:18] !log marostegui@dns1006 START - running authdns-update [06:26:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2179 T414543', diff saved to https://phabricator.wikimedia.org/P87779 and previous config saved to /var/cache/conftool/dbconfig/20260120-062653-marostegui.json [06:27:25] !log marostegui@dns1006 END - running authdns-update [06:27:27] (03CR) 10Ayounsi: [C:03+1] dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [06:29:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:30:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-eqdfw and Hurricane Electric (2001:504:0:5::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:30:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11535590 (10Marostegui) @jcrespo this is a backup source. [06:32:02] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Remove old mwmaint IP [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (https://phabricator.wikimedia.org/T397017) (owner: 10Marostegui) [06:33:04] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11535605 (10ayounsi) A maybe safer alternative is to first enable IPv6 BGP peering between the network and all the dnsbox with `profile::bird::do_ipv6: true` (and the Homer patches). BGP over v6 w... [06:33:27] (03CR) 10Marostegui: [C:03+2] "Grants removed from production too." [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (https://phabricator.wikimedia.org/T397017) (owner: 10Marostegui) [06:37:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Schema change [06:38:17] (03CR) 10Ayounsi: [C:03+1] Add config for authdns IPv6 public IPs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [06:38:23] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1228992 [06:44:16] (03CR) 10Marostegui: [C:03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1228992 (owner: 10Marostegui) [06:46:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [06:46:51] (03CR) 10Clément Goubert: ratelimit-media: Initial service deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [06:47:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:47:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87780 and previous config saved to /var/cache/conftool/dbconfig/20260120-064708-marostegui.json [06:47:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:47:17] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:47:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:48:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87781 and previous config saved to /var/cache/conftool/dbconfig/20260120-064801-marostegui.json [06:49:00] (03PS4) 10Clément Goubert: Add ratelimit-upload namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) [06:49:00] (03PS4) 10Clément Goubert: ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) [06:49:23] (03PS5) 10Clément Goubert: Add ratelimit-media namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) [06:49:38] !log Deploy schema change on s8 sanitarium master - s8 wikireplicas will be lagging for many hours T411164 T411163 [06:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:26] (03PS1) 10Kosta Harlan: Hooks: Log the security log context for edit errors [extensions/WikiEditor] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1228993 (https://phabricator.wikimedia.org/T410877) [06:56:53] (03PS1) 10Marostegui: db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1228994 [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0700). [07:05:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr2-eqdfw and Hurricane Electric (2001:504:0:5::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDow [07:06:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikiEditor] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1228993 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [07:06:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [07:07:23] (03PS1) 10Ayounsi: magru geo-maps: match DNS discovery records [dns] - 10https://gerrit.wikimedia.org/r/1228996 (https://phabricator.wikimedia.org/T411617) [07:13:18] (03PS1) 10Kevin Bazira: ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) [07:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:19:34] (03CR) 10CI reject: [V:04-1] ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [07:22:52] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:24:00] (03CR) 10Clément Goubert: [C:03+2] Add ratelimit-media namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [07:31:15] (03PS2) 10Kevin Bazira: ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) [07:31:41] (03Merged) 10jenkins-bot: Add ratelimit-media namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [07:38:10] (03CR) 10CI reject: [V:04-1] ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [07:38:22] (03PS3) 10Daniel Kinzler: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) [07:38:31] (03CR) 10Daniel Kinzler: "I wanted to stay consistent with the errors we return fromt he edge: https://gerrit.wikimedia.org/g/operations/puppet/+/6ed5197b084ee17b0e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [07:41:14] (03CR) 10Muehlenhoff: [C:03+2] DNS: Enable Bird 2.18 for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1228559 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [07:41:26] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:42:30] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:42:50] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:42:52] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:45:36] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:46:38] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:47:52] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:48:49] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:49:05] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:50:25] !log installing unbound security updates [07:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:10] (03PS3) 10Kevin Bazira: ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) [07:52:52] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:53:30] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:57:43] (03PS1) 10Bartosz Wójtowicz: ml-services: Enable multiple workers for revise tone service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229059 (https://phabricator.wikimedia.org/T411758) [08:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0800). [08:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:07] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Enable multiple workers for revise tone service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229059 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [08:05:00] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Enable multiple workers for revise tone service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229059 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:52] (03Merged) 10jenkins-bot: ml-services: Enable multiple workers for revise tone service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229059 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [08:08:38] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [08:09:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87783 and previous config saved to /var/cache/conftool/dbconfig/20260120-080935-marostegui.json [08:09:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:09:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:10:04] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [08:11:22] I’ll be around a bit later in this window to backport my patches [08:15:50] (03CR) 10Brouberol: Define the test-kitchen service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [08:18:01] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.14 [software/bitu] - 10https://gerrit.wikimedia.org/r/1228520 (owner: 10Slyngshede) [08:19:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P87784 and previous config saved to /var/cache/conftool/dbconfig/20260120-081944-marostegui.json [08:20:31] (03Merged) 10jenkins-bot: Release version 0.1.14 [software/bitu] - 10https://gerrit.wikimedia.org/r/1228520 (owner: 10Slyngshede) [08:26:00] git st [08:26:05] git st [08:26:08] git st [08:26:17] git add modules/envoyproxy/manifests/tls_terminator.pp modules/envoyproxy/templates/tls_terminator/listener.yaml.erb modules/profile/manifests/tlsproxy/envoy.pp [08:26:19] git commit -a --amend [08:26:30] git sto- Add ratelimit lua script [08:29:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P87785 and previous config saved to /var/cache/conftool/dbconfig/20260120-082952-marostegui.json [08:31:01] ok, getting started with my patches [08:31:11] claime: wrong terminal tab? [08:31:42] kostajh: jesus... somehow my terminator switched to "All terminal" key send [08:31:48] I'm sorry about that [08:31:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [08:32:06] At least it was just commands... [08:32:57] (03Merged) 10jenkins-bot: IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [08:33:46] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1223636|IPReputation: Enable OpenSearch IPoid provider on testwiki (T410615)]] [08:33:51] T410615: Update Extension:IPReputation to support OpenSearch - https://phabricator.wikimedia.org/T410615 [08:36:02] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1223636|IPReputation: Enable OpenSearch IPoid provider on testwiki (T410615)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:37:03] (03CR) 10Dpogorzelski: [C:03+1] ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [08:37:29] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228549 (owner: 10Dpogorzelski) [08:37:56] (03PS1) 10Slyngshede: idm: move to Bitu 0.1.15 [dns] - 10https://gerrit.wikimedia.org/r/1229065 [08:39:18] (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [08:39:29] !log kharlan@deploy2002 kharlan: Continuing with sync [08:40:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87786 and previous config saved to /var/cache/conftool/dbconfig/20260120-084001-marostegui.json [08:40:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:40:08] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:40:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [08:41:40] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.0c in codfw [08:43:58] claime: no worries :) [08:44:01] (03CR) 10Effie Mouzeli: [C:03+1] partman: New mc nodes need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220373 (https://phabricator.wikimedia.org/T412255) (owner: 10Clément Goubert) [08:44:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.0c in codfw [08:44:19] (03CR) 10Effie Mouzeli: [C:03+1] site.pp: Add mc1055-72 [puppet] - 10https://gerrit.wikimedia.org/r/1226782 (https://phabricator.wikimedia.org/T412255) (owner: 10Clément Goubert) [08:45:30] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223636|IPReputation: Enable OpenSearch IPoid provider on testwiki (T410615)]] (duration: 11m 43s) [08:45:35] T410615: Update Extension:IPReputation to support OpenSearch - https://phabricator.wikimedia.org/T410615 [08:45:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1228993 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [08:46:36] (03CR) 10JMeybohm: [C:03+1] "Hm, bummer :/" [puppet] - 10https://gerrit.wikimedia.org/r/1228549 (owner: 10Dpogorzelski) [08:46:47] (03Merged) 10jenkins-bot: ml-services: bump rr-wikidata limitranges and resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228997 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [08:47:41] (03Merged) 10jenkins-bot: Hooks: Log the security log context for edit errors [extensions/WikiEditor] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1228993 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [08:48:13] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1228993|Hooks: Log the security log context for edit errors (T410877)]] [08:48:17] T410877: WikiEditor: Log unknown codes to Logstash - https://phabricator.wikimedia.org/T410877 [08:50:17] (03CR) 10Joal: [C:03+1] Exclude old an-worker hosts from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1228479 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [08:50:19] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1228993|Hooks: Log the security log context for edit errors (T410877)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:51:17] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:51:36] !log kharlan@deploy2002 kharlan: Continuing with sync [08:52:58] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:53:32] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:53:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1229065 (owner: 10Slyngshede) [08:54:12] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:17] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:54:52] (03CR) 10Brouberol: [C:04-1] "That won't work because the services you've added are not defined in [puppet](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/pu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [08:54:58] (03CR) 10Clément Goubert: [C:03+2] partman: New mc nodes need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220373 (https://phabricator.wikimedia.org/T412255) (owner: 10Clément Goubert) [08:55:09] (03CR) 10Clément Goubert: [C:03+2] site.pp: Add mc1055-72 [puppet] - 10https://gerrit.wikimedia.org/r/1226782 (https://phabricator.wikimedia.org/T412255) (owner: 10Clément Goubert) [08:55:11] (03CR) 10Slyngshede: [C:03+2] idm: move to Bitu 0.1.15 [dns] - 10https://gerrit.wikimedia.org/r/1229065 (owner: 10Slyngshede) [08:55:28] !log slyngshede@dns1004 START - running authdns-update [08:55:33] !log slyngshede@dns1004 START - running authdns-update [08:55:37] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1228993|Hooks: Log the security log context for edit errors (T410877)]] (duration: 07m 24s) [08:55:42] T410877: WikiEditor: Log unknown codes to Logstash - https://phabricator.wikimedia.org/T410877 [08:56:44] !log slyngshede@dns1004 END - running authdns-update [08:59:41] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [08:59:49] (03CR) 10Vgutierrez: cache::upload: enable global ratelimiting (magru) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:00:05] andre and jeena: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T0900). [09:06:40] (03PS1) 10Muehlenhoff: Simplify partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/1229067 [09:07:30] !log dpogorzelski@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet with reason: testing ml changes [09:07:46] (03CR) 10Dpogorzelski: [C:03+2] docker registry: Add ml-build user to regular push [puppet] - 10https://gerrit.wikimedia.org/r/1228549 (owner: 10Dpogorzelski) [09:10:31] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11535873 (10tappof) I’ve just opened an issue upstream: https://github.com/thanos-io/thanos/issues/8641 [09:12:37] (03PS1) 10Dpogorzelski: docker-registry: fix variable declaration [puppet] - 10https://gerrit.wikimedia.org/r/1229068 [09:14:01] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229069 (https://phabricator.wikimedia.org/T413803) [09:14:05] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229069 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:15:06] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229069 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:16:01] (03CR) 10Santiago Faci: [C:03+1] Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [09:16:06] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229068 (owner: 10Dpogorzelski) [09:17:30] (03PS1) 10Kevin Bazira: ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229071 (https://phabricator.wikimedia.org/T414060) [09:19:13] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:19:45] (03PS6) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [09:20:13] (03CR) 10Elukey: [C:03+1] docker-registry: fix variable declaration [puppet] - 10https://gerrit.wikimedia.org/r/1229068 (owner: 10Dpogorzelski) [09:20:20] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:20:29] (03CR) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [09:20:32] (03PS2) 10Dpogorzelski: docker-registry: fix variable declaration [puppet] - 10https://gerrit.wikimedia.org/r/1229068 [09:20:40] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229068 (owner: 10Dpogorzelski) [09:21:07] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.12 refs T413803 [09:21:12] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [09:21:34] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [09:22:00] (03CR) 10Dpogorzelski: [C:03+2] docker-registry: fix variable declaration [puppet] - 10https://gerrit.wikimedia.org/r/1229068 (owner: 10Dpogorzelski) [09:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:27:27] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetConstantChange (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T414968#11535929 (10SLyngshede-WMF) 05Open→03Resolved Leftovers from moving ldaptui to OS packages: Fix: ` $ cd /srv/ldaptui/ $ sudo rm -... [09:30:01] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3), 07Upstream: Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11535949 (10Aklapper) [09:33:14] (03CR) 10Elukey: [C:03+1] Simplify partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/1229067 (owner: 10Muehlenhoff) [09:33:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11535957 (10cmooney) Yeah I was worried we'd see the same pattern as the graph in the task description... [09:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:31] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:36:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:38:45] (03CR) 10Ayounsi: [C:03+2] "Tested a last time on sretest2003" [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [09:39:52] (03CR) 10Btullis: [V:03+1 C:03+2] Exclude old an-worker hosts from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1228479 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [09:40:39] (03PS3) 10Brouberol: global_config: add external-services for all eventgate LVS endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1229072 (https://phabricator.wikimedia.org/T411989) [09:40:46] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.remove-downtime for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet [09:40:49] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet [09:41:20] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229071 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [09:42:43] (03CR) 10Kevin Bazira: [C:03+2] ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229071 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [09:43:35] (03Merged) 10jenkins-bot: sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [09:44:29] (03Merged) 10jenkins-bot: ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229071 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [09:46:27] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:50:00] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:50:43] (03PS1) 10Muehlenhoff: Bump the 6.12 backport for Bookworm to 6.12.57 [puppet] - 10https://gerrit.wikimedia.org/r/1229074 (https://phabricator.wikimedia.org/T414460) [09:51:22] (03PS2) 10Muehlenhoff: Bump the 6.12 backport for Bookworm to 6.12.57 [puppet] - 10https://gerrit.wikimedia.org/r/1229074 (https://phabricator.wikimedia.org/T414460) [09:52:44] (03PS2) 10Marostegui: db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1228994 [09:52:44] (03PS1) 10Marostegui: dbproxy1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1229075 (https://phabricator.wikimedia.org/T414656) [09:53:17] (03CR) 10Marostegui: [C:03+2] db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1228994 (owner: 10Marostegui) [09:53:34] (03PS1) 10Aklapper: admin: remove old ssh key of aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1224584 (https://phabricator.wikimedia.org/T413009) [09:53:47] (03CR) 10Marostegui: [C:03+2] dbproxy1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1229075 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [09:54:09] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1229074 (https://phabricator.wikimedia.org/T414460) (owner: 10Muehlenhoff) [09:54:15] (03CR) 10Aklapper: "Marking as ready to go as train deployment with new key also worked as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1224584 (https://phabricator.wikimedia.org/T413009) (owner: 10Aklapper) [09:54:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1024.eqiad.wmnet with OS trixie [09:57:12] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11536035 (10ayounsi) With that cookbook change merged, new Dell servers (or any that we use the provision cookbook on) will have their LLDP setting changed. W... [09:57:29] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11536036 (10ayounsi) 05Open→03Stalled a:05ayounsi→03None [09:57:30] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429#11536039 (10ayounsi) [09:59:07] (03CR) 10Cathal Mooney: [C:03+2] Add config for authdns IPv6 public IPs [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [10:00:25] (03Merged) 10jenkins-bot: Add config for authdns IPv6 public IPs [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [10:05:11] (03CR) 10Muehlenhoff: [C:03+2] Bump the 6.12 backport for Bookworm to 6.12.57 [puppet] - 10https://gerrit.wikimedia.org/r/1229074 (https://phabricator.wikimedia.org/T414460) (owner: 10Muehlenhoff) [10:05:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11536050 (10MoritzMuehlenhoff) I suggest we first move to the latest 6.12 backport to rule that this isn't a... [10:08:15] !log installing postgresql-15 security updates [10:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:57] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy1024.eqiad.wmnet with OS trixie [10:17:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1024.eqiad.wmnet with OS trixie [10:19:31] (03PS3) 10Vgutierrez: cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1228568 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:22:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#11536131 (10MLechvien-WMF) [10:22:25] (03CR) 10Milimetric: trafficserver: Send /ins-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [10:23:20] !log Running `foreachwikiindblist group2.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 500` for T413868 [10:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:25] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [10:23:57] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [10:24:17] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#11536146 (10MLechvien-WMF) Hi @Volans can I confirm the status of this task? As noted in description this would be benefi... [10:24:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11536148 (10MoritzMuehlenhoff) >>! In T414460#11536050, @MoritzMuehlenhoff wrote: > I suggest we first move... [10:25:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:19] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11536198 (10FCeratto-WMF) [10:28:20] ^^ download.microsoft.com has a broken TLS setup since last night and it's triggering some errors on dump_cloud_ip_ranges [10:28:54] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1228568 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:31:04] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11536203 (10cmooney) >>! In T81605#11535605, @ayounsi wrote: > A maybe safer alternative is to first enable IPv6 BGP peering between the network and all the dnsbox with `profile::bird::do_ipv6: tr... [10:33:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [10:33:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11536220 (10jcrespo) ` 2026-01-19 23:54:25 The Physical Drive (PD) Disk 4 in Backplane 1 of Integrated RAID Controller 1 is not correctly functioning. Part n... [10:38:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [10:41:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11536288 (10jcrespo) Dc ops, could I request if you could find a replacement disk? [10:41:23] !log Restarted group1 run with `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 1000 --sleep 1` for T413868 [10:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:27] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [10:42:06] !log Restarted group2 run with `foreachwikiindblist group2.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 1000 --sleep 1` for T413868 [10:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:08] !log Stopped all maintenance script runs for T413868 [10:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:02] !log Running `foreachwikiindblist large.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 1000 --sleep 1` for T413868 [10:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:24] !log Running `foreachwikiindblist small.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 1000` for T413868 [10:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] phab seems to be having an outage [10:53:12] XioNoX: vgutierrez: for awareness, i'm either getting very slow responses or outright "upstream connect error or disconnect/reset before headers. reset reason: connection timeout: 0" [10:54:14] also phab based error messages "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." [10:55:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:50] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11536349 (10MLechvien-WMF) @Blake could you move this on the board if you plan to do it this quart... [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1100) [11:00:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1024.eqiad.wmnet with OS trixie [11:06:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11536381 (10BTullis) >>! In T414460#11536148, @MoritzMuehlenhoff wrote: >>>! In T414460#11536050, @Mor... [11:07:46] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:09:16] (03PS1) 10Marostegui: Revert "dbproxy1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1229085 [11:15:32] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-eqiad [11:15:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11536410 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [11:16:22] (03PS1) 10Bartosz Wójtowicz: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229086 (https://phabricator.wikimedia.org/T408538) [11:17:32] (03PS1) 10Brouberol: trafficserver: remove mpic.w.o from the list of targets [puppet] - 10https://gerrit.wikimedia.org/r/1229087 (https://phabricator.wikimedia.org/T407808) [11:17:53] (03CR) 10Alexandros Kosiaris: "Good points, both checks removed." [puppet] - 10https://gerrit.wikimedia.org/r/1228580 (owner: 10Alexandros Kosiaris) [11:17:56] (03CR) 10Muehlenhoff: [C:03+2] admin: remove old ssh key of aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1224584 (https://phabricator.wikimedia.org/T413009) (owner: 10Aklapper) [11:18:49] (03PS2) 10Alexandros Kosiaris: Remove kernelversion check for BBR and unprivileged BPF [puppet] - 10https://gerrit.wikimedia.org/r/1228580 [11:18:53] (03PS2) 10Alexandros Kosiaris: profile::base: Remove a superfluous $::site check [puppet] - 10https://gerrit.wikimedia.org/r/1228581 [11:18:57] (03PS4) 10Alexandros Kosiaris: base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) [11:19:01] (03PS4) 10Alexandros Kosiaris: base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) [11:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:24:24] (03CR) 10Santiago Faci: [C:03+1] trafficserver: remove mpic.w.o from the list of targets [puppet] - 10https://gerrit.wikimedia.org/r/1229087 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:24:32] (03PS4) 10Blake: sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) [11:24:36] (03CR) 10Blake: "Please let me know if this seems like a reasonable approach - I'm not very clear yet on how the multi-stage cookbooks are executed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [11:24:44] (03CR) 10Brouberol: [C:03+2] trafficserver: remove mpic.w.o from the list of targets [puppet] - 10https://gerrit.wikimedia.org/r/1229087 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:24:49] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:24:52] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:25:19] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228580 (owner: 10Alexandros Kosiaris) [11:26:43] (03Abandoned) 10Brouberol: global_config: add external-services for all eventgate LVS endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1229072 (https://phabricator.wikimedia.org/T411989) (owner: 10Brouberol) [11:29:32] (03Merged) 10jenkins-bot: Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:30:00] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11536627 (10Blake) a:03Blake [11:30:16] (03Merged) 10jenkins-bot: Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:30:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:30:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:33:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [11:33:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [11:35:13] !log Running `foreachwikiindblist medium.dblist extensions/CheckUser/maintenance/populateUserAgentTable.php --batch-size 1000 -sleep 1` for T413868 [11:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:18] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [11:36:17] FIRING: [2x] SLOMetricAbsent: xlab-combined-latency-success-v1 - https://slo.wikimedia.org/?search=xlab-combined-latency-success-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:36:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [11:36:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [11:37:50] (03PS2) 10Federico Ceratto: admin: add pham to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1228481 (https://phabricator.wikimedia.org/T414660) [11:39:24] PROBLEM - Host dse-k8s-worker1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:10] RECOVERY - Host dse-k8s-worker1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [11:41:12] !log Running `/usr/local/bin/foreachwikiindblist "all.dblist - mediamoderation-continuous-scan.dblist - preinstall.dblist" extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep=1 --poll-sleep=10 --verbose` [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:00] (03PS1) 10Brouberol: dse-k8s-eqiad: remove mpic helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229096 (https://phabricator.wikimedia.org/T407808) [11:42:02] (03PS1) 10Brouberol: Delete the mpic chart, that was replaced by test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229097 (https://phabricator.wikimedia.org/T407808) [11:42:05] (03PS1) 10Brouberol: dse-k8s-eqiad: remove mpic namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229098 (https://phabricator.wikimedia.org/T407808) [11:42:52] !log btullis@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:dse-k8s-worker-eqiad [11:44:40] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker[1002-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [11:44:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11536863 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [11:46:39] (03PS1) 10Brouberol: deployment_server: remove the mpic kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1229099 (https://phabricator.wikimedia.org/T407808) [11:46:41] (03PS1) 10Brouberol: cache: remove caching rules related to the mpic domains [puppet] - 10https://gerrit.wikimedia.org/r/1229100 (https://phabricator.wikimedia.org/T407808) [11:46:44] (03PS1) 10Brouberol: mariadb: replace mpic database names by their test_kitchen counterpart [puppet] - 10https://gerrit.wikimedia.org/r/1229101 (https://phabricator.wikimedia.org/T407808) [11:46:47] (03PS1) 10Brouberol: service: drop the mpic services from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1229102 (https://phabricator.wikimedia.org/T407808) [11:46:50] (03PS1) 10Brouberol: service_proxy: rename the mpic listener to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229103 (https://phabricator.wikimedia.org/T407808) [11:48:22] (03PS1) 10Brouberol: data_plaform/slos: rename mpic to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229104 (https://phabricator.wikimedia.org/T407808) [11:51:06] (03CR) 10Santiago Faci: [C:03+1] dse-k8s-eqiad: remove mpic helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229096 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:51:23] (03CR) 10Santiago Faci: [C:03+1] Delete the mpic chart, that was replaced by test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229097 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:51:41] (03CR) 10Santiago Faci: [C:03+1] dse-k8s-eqiad: remove mpic namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229098 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:51:42] PROBLEM - Host dse-k8s-worker1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:15] (03CR) 10Santiago Faci: [C:03+1] deployment_server: remove the mpic kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1229099 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:52:33] (03CR) 10Santiago Faci: [C:03+1] cache: remove caching rules related to the mpic domains [puppet] - 10https://gerrit.wikimedia.org/r/1229100 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:52:52] (03CR) 10Santiago Faci: [C:03+1] mariadb: replace mpic database names by their test_kitchen counterpart [puppet] - 10https://gerrit.wikimedia.org/r/1229101 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:53:10] RECOVERY - Host dse-k8s-worker1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:53:40] (03CR) 10Santiago Faci: [C:03+1] service: drop the mpic services from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1229102 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:54:03] (03CR) 10Santiago Faci: [C:03+1] service_proxy: rename the mpic listener to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229103 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:54:12] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: remove mpic helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229096 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:54:15] (03CR) 10Brouberol: [C:03+2] Delete the mpic chart, that was replaced by test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229097 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:54:18] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: remove mpic namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229098 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:55:06] (03CR) 10Santiago Faci: [C:03+1] data_plaform/slos: rename mpic to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229104 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:13] (03CR) 10Brouberol: [C:03+2] deployment_server: remove the mpic kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1229099 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:15] (03CR) 10Brouberol: [C:03+2] cache: remove caching rules related to the mpic domains [puppet] - 10https://gerrit.wikimedia.org/r/1229100 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:17] (03CR) 10Brouberol: [C:03+2] mariadb: replace mpic database names by their test_kitchen counterpart [puppet] - 10https://gerrit.wikimedia.org/r/1229101 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:19] (03CR) 10Brouberol: [C:03+2] service: drop the mpic services from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1229102 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:22] (03CR) 10Brouberol: [C:03+2] service_proxy: rename the mpic listener to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229103 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:22] (03Merged) 10jenkins-bot: dse-k8s-eqiad: remove mpic helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229096 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:25] (03CR) 10Brouberol: [C:03+2] data_plaform/slos: rename mpic to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1229104 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [11:56:28] (03Merged) 10jenkins-bot: Delete the mpic chart, that was replaced by test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229097 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [12:00:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1228481 (https://phabricator.wikimedia.org/T414660) (owner: 10Federico Ceratto) [12:01:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1228580 (owner: 10Alexandros Kosiaris) [12:01:49] (03Merged) 10jenkins-bot: dse-k8s-eqiad: remove mpic namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229098 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [12:04:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1228581 (owner: 10Alexandros Kosiaris) [12:05:59] (03PS1) 10Slyngshede: Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) [12:06:42] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for deploying!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229086 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:11:11] (03PS1) 10Muehlenhoff: Remove puppetmaster::updatenetboot [puppet] - 10https://gerrit.wikimedia.org/r/1229107 (https://phabricator.wikimedia.org/T365798) [12:12:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [12:12:40] (03PS2) 10Slyngshede: Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) [12:13:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:14:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:15:30] (03PS2) 10Muehlenhoff: Remove puppetmaster::updatenetboot [puppet] - 10https://gerrit.wikimedia.org/r/1229107 (https://phabricator.wikimedia.org/T365798) [12:15:58] (03PS1) 10Federico Ceratto: service, trafficserver: Prepare "linked-artifacts" k8s pod [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) [12:17:56] (03PS1) 10Federico Ceratto: admin: add pham to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1229105 (https://phabricator.wikimedia.org/T414671) [12:18:02] (03CR) 10Federico Ceratto: [C:03+2] admin: add pham to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1228481 (https://phabricator.wikimedia.org/T414660) (owner: 10Federico Ceratto) [12:19:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1229105 (https://phabricator.wikimedia.org/T414671) (owner: 10Federico Ceratto) [12:19:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:28:48] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229086 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:30:34] (03Merged) 10jenkins-bot: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229086 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:30:35] (03PS1) 10Muehlenhoff: Remove Puppet 5 volatile directory from backups [puppet] - 10https://gerrit.wikimedia.org/r/1229110 (https://phabricator.wikimedia.org/T365798) [12:32:24] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:33:39] (03CR) 10Muehlenhoff: [C:03+2] Simplify partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/1229067 (owner: 10Muehlenhoff) [12:34:36] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:36:59] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:49:44] !log installing git security updates [12:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:59] (03CR) 10Clément Goubert: service, trafficserver: Prepare "linked-artifacts" k8s pod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [12:54:12] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1300) [13:02:53] jouncebot: nowandnext [13:02:54] For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1300) [13:02:54] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1400) [13:03:04] I'm gonna lock scap for a minute, I need to test something [13:03:42] !log cgoubert@deploy2002 Locking from deployment [ALL REPOSITORIES]: Testing scap bg lock [13:04:04] !log cgoubert@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:04:16] !log cgoubert@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:04:23] !log cgoubert@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Testing scap bg lock (duration: 00m 40s) [13:06:35] !log cgoubert@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-lock-scap for datacenter switchover from codfw to eqiad [13:06:37] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-lock-scap (exit_code=99) for datacenter switchover from codfw to eqiad [13:06:56] That's expected [13:06:58] All done [13:09:33] Actually one more test [13:09:58] !log root@deploy2002 Locking from deployment [ALL REPOSITORIES]: Testing scap bg lock [13:10:12] !log root@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:10:30] !log root@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:10:44] !log cgoubert@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:10:46] !log root@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Testing scap bg lock (duration: 00m 47s) [13:15:12] (03CR) 10Clément Goubert: [C:04-1] "Two small improvements inline, but while testing I discovered a couple of issues that I'd like @adancy@wikimedia.org to weigh in on." [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [13:17:55] !log cgoubert@deploy2002 Locking from deployment [ALL REPOSITORIES]: Testing scap bg lock [13:18:05] !log cgoubert@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:18:14] !log cgoubert@deploy2002 Forcefully removing global lock: Testing scap bg lock [13:18:16] !log cgoubert@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Testing scap bg lock (duration: 00m 20s) [13:18:16] (03Abandoned) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [13:18:21] (03Abandoned) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [13:19:13] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:24:02] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11537269 (10Clement_Goubert) [13:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:28:30] (03CR) 10Federico Ceratto: [C:03+2] admin: add pham to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1229105 (https://phabricator.wikimedia.org/T414671) (owner: 10Federico Ceratto) [13:30:02] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11537281 (10FCeratto-WMF) 05In progress→03Resolved a:03FCeratto-WMF Change deployed, closing task. [13:30:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11537284 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF Change deployed, closing task. [13:31:44] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore game show domains [puppet] - 10https://gerrit.wikimedia.org/r/1228593 (owner: 10BCornwall) [13:32:52] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1228603 (owner: 10Ncmonitor) [13:32:59] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1228604 (owner: 10Ncmonitor) [13:33:02] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1228605 (owner: 10Ncmonitor) [13:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:40] (03CR) 10Federico Ceratto: "Adding Eric in Cc" [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [13:38:58] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:40:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11537364 (10Jclark-ctr) a:03Jclark-ctr [13:45:26] 06SRE, 10observability, 10Prod-Kubernetes, 06ServiceOps new: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11537378 (10hnowlan) o11y will look into adding recording rules for this dashboard. [13:45:44] 06SRE, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11537384 (10hnowlan) [13:48:23] (03CR) 10Pmiazga: [C:03+1] rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [13:51:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11537424 (10Jclark-ctr) You have successfully submitted request SR221528423. [13:52:13] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11537425 (10VRiley-WMF) Currently looking at this device to see what could be causing the issue. Noted, there is no LED activity on the NIC. Attempted to reseated... [13:52:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11537439 (10Jclark-ctr) a:03Jclark-ctr [13:54:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11537452 (10Jclark-ctr) [13:54:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11537453 (10Jclark-ctr) [13:55:01] (03PS1) 10Daniel Kinzler: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 [13:55:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11537454 (10Jclark-ctr) [13:56:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11537455 (10Jclark-ctr) [13:59:32] (03PS14) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [13:59:48] (03PS5) 10Daniel Kinzler: rest gateway: implement per-policy shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225699 (https://phabricator.wikimedia.org/T413183) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:01:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11537493 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty can it be swapped with a disk from a decom server? if thats that works and i have on... [14:02:28] (03CR) 10Jcrespo: [C:03+1] Remove Puppet 5 volatile directory from backups [puppet] - 10https://gerrit.wikimedia.org/r/1229110 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:03:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11537507 (10Jclark-ctr) You have successfully submitted request SR221528999. [14:05:56] (03CR) 10Alexandros Kosiaris: [C:03+2] "PCC is happy, no breaks with the change, merging. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1228580 (owner: 10Alexandros Kosiaris) [14:06:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11537510 (10Jclark-ctr) a:05Jclark-ctr→03herron [14:06:59] (03CR) 10Blake: sre.switchdc.mediawiki: Automate scap lock/unlock (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [14:07:00] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1228581 (owner: 10Alexandros Kosiaris) [14:16:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{dse-k8s-worker[1002-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [14:18:31] (03CR) 10Eevans: service, trafficserver: Prepare "linked-artifacts" k8s pod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [14:21:44] !log switching off Blazegraph on wdqs2009 (legacy full graph endpoint is end of life) - T411410 [14:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:48] T411410: Decommission WDQS full graph endpoint (wdqs2009) - https://phabricator.wikimedia.org/T411410 [14:25:40] (03CR) 10Elukey: Docker build (035 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [14:27:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:36] !log blake@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-eqiad [14:30:22] (03CR) 10Vgutierrez: service, trafficserver: Prepare "linked-artifacts" k8s pod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [14:30:55] (03CR) 10Vgutierrez: [C:04-1] service, trafficserver: Prepare "linked-artifacts" k8s pod [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [14:32:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:19] (03Abandoned) 10Ladsgroup: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188330 (owner: 10PipelineBot) [14:37:49] (03Abandoned) 10Ladsgroup: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189506 (owner: 10PipelineBot) [14:38:00] (03Abandoned) 10Ladsgroup: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191374 (owner: 10PipelineBot) [14:38:19] (03Abandoned) 10Ladsgroup: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196689 (owner: 10PipelineBot) [14:42:57] (03CR) 10Eevans: [C:03+2] cassandra: Drop departed staff db [puppet] - 10https://gerrit.wikimedia.org/r/1228300 (owner: 10Ladsgroup) [14:44:02] RECOVERY - Host an-worker1160 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:45:44] jouncebot: nowandnext [14:45:44] For the next 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1400) [14:45:44] In 0 hour(s) and 14 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1500) [14:46:16] (03CR) 10Zabe: [C:03+2] Start writing to il_target_id everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228506 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [14:47:09] (03Merged) 10jenkins-bot: Start writing to il_target_id everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228506 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [14:47:47] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1228506|Start writing to il_target_id everywhere except commons (T413526)]] [14:47:52] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [14:48:47] 06SRE, 07sre-alert-triage, 10Maps: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#11537604 (10JMeybohm) Removing #serviceops since we won't be working on this. [14:49:56] !log zabe@deploy2002 zabe: Backport for [[gerrit:1228506|Start writing to il_target_id everywhere except commons (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:49:59] (03CR) 10Eevans: [C:03+2] "For posterity sake: This also requires removal of the role (or grants). I've done that for the aqs cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1228300 (owner: 10Ladsgroup) [14:50:07] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11537611 (10Jclark-ctr) Replaced Cable link came up ` jclark@lsw1-e6-eqiad> show interfaces xe-0/0/20 terse Interface Admin Link Proto Local... [14:50:18] !log zabe@deploy2002 zabe: Continuing with sync [14:50:26] (03PS1) 10BCornwall: ncmonitor: Ignore wikipediathegame.{com,org} [puppet] - 10https://gerrit.wikimedia.org/r/1229125 [14:54:26] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1228506|Start writing to il_target_id everywhere except commons (T413526)]] (duration: 06m 39s) [14:54:32] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [14:55:24] (03PS1) 10Zabe: Start writing il_target_id on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229126 (https://phabricator.wikimedia.org/T413526) [14:57:07] 06SRE: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830#11537635 (10AStein-WMF) [14:59:25] (03PS1) 10Superpes15: [itwiki] Change tagline for Wikipedia25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229130 (https://phabricator.wikimedia.org/T414320) [14:59:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11537654 (10brouberol) I'm now able to ssh onto the host. Thank you for checking! [14:59:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11537659 (10brouberol) 05Open→03Resolved [14:59:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11537658 (10brouberol) [15:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1500) [15:00:38] !log running `migrateLinksTable.php --table imagelinks` on s5 wikis # T413668 [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] T413668: Run the data migration of imagelinks - https://phabricator.wikimedia.org/T413668 [15:01:10] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11537664 (10FCeratto-WMF) @FRomeo_WMF and @greg hello - could you please review the request for approval? [15:01:18] (03PS1) 10Jakob: Stop logging batch start [dumps] - 10https://gerrit.wikimedia.org/r/1229127 (https://phabricator.wikimedia.org/T408423) [15:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:19] (03CR) 10Ssingh: [C:03+1] ncmonitor: Ignore wikipediathegame.{com,org} [puppet] - 10https://gerrit.wikimedia.org/r/1229125 (owner: 10BCornwall) [15:09:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11537695 (10Jclark-ctr) @jcrespo i have a disk available can it be swapped any time? [15:10:16] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore wikipediathegame.{com,org} [puppet] - 10https://gerrit.wikimedia.org/r/1229125 (owner: 10BCornwall) [15:10:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11537702 (10VRiley-WMF) Hey @BTullis Is this something we can close, or would you like me to replace the drive? Just to make sure that... [15:15:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:15:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11537736 (10Jclark-ctr) I noticed the drive was already :removed: in IDRAC So i did swap drive just now @jcrespo [15:16:03] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11537743 (10ssingh) Yeah, unless we update our own zone files but more specifically, Markmonitor, nothing really changes so we can just go ahead with the approach @cmooney suggested and enable it... [15:17:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T415001#11537746 (10Jclark-ctr) Virtual drive is rebuilding Will monitor. and close when finished RAID Information Progress 1% [15:17:25] !log blake@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-eqiad [15:18:18] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T414939#11537752 (10VRiley-WMF) a:03VRiley-WMF [15:18:58] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T414939#11537754 (10VRiley-WMF) →14Duplicate dup:03T412733 [15:19:04] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11537756 (10VRiley-WMF) [15:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:40] !log blake@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-codfw [15:20:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:22:11] (03CR) 10Clément Goubert: [C:04-1] sre.switchdc.mediawiki: Automate scap lock/unlock (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [15:23:09] (03PS1) 10Superpes15: [hawiki] Add a temporary wordmark and tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229132 (https://phabricator.wikimedia.org/T414736) [15:23:56] (03CR) 10CI reject: [V:04-1] [hawiki] Add a temporary wordmark and tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229132 (https://phabricator.wikimedia.org/T414736) (owner: 10Superpes15) [15:25:42] (03CR) 10Vgutierrez: [C:03+2] cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1228568 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:27:24] (03CR) 10Ssingh: [C:03+1] "Thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1228996 (https://phabricator.wikimedia.org/T411617) (owner: 10Ayounsi) [15:27:42] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Updated measurement of request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T415080 (10MatthewVernon) 03NEW [15:28:12] (03CR) 10Ssingh: [C:03+2] wikimedia/wikipedia.org: match TTLs for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:28:32] !log sukhe@dns1004 START - running authdns-update [15:28:50] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Updated measurement of request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T415080#11537841 (10MatthewVernon) 05Open→03Resolved p:05Triage→03High The query used (the same as last time, modulo da... [15:29:07] !log sukhe@dns1004 START - running authdns-update [15:29:33] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846#11537846 (10Ahoelzl) Approved. [15:29:38] !log ran authdns-update for CR 1226904 to match zone file TTLs with registrar for wikimedia.org/wikipedia.org NS/glue records [15:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1530) [15:30:14] !log sukhe@dns1004 END - running authdns-update [15:32:15] (03CR) 10Ssingh: [V:03+1] "[-2 was self-inflicted to prevent accidental merge.]" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:32:34] !log bking@deploy2002 renewing TLS certificates on opensearch-ipoid codfw (AKA deleting pods one-by-one) [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:12] (03PS4) 10Superpes15: [hawiki] Add a temporary wordmark and tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229132 (https://phabricator.wikimedia.org/T414736) [15:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:09] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:36] (03PS2) 10Superpes15: [itwiki] Change tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229130 (https://phabricator.wikimedia.org/T414320) [15:38:57] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1059423'": T117618 [15:38:58] RESOLVED: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:03] T117618: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 [15:39:29] (03CR) 10Btullis: [C:03+2] Add akhatun to the analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1228429 (https://phabricator.wikimedia.org/T414846) (owner: 10Btullis) [15:40:35] (03CR) 10Ssingh: [C:03+2] varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [15:41:31] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker[1006-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [15:41:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11537913 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [15:43:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{dse-k8s-worker[1006-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [15:44:16] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11537922 (10Novem_Linguae) I don't see krb: present in the patch, so looks like this was done as level 2 instead of the requested level 3. Is this correct? [15:47:27] !log enable puppet on cp110[15].eqiad.wmnet: T117618 [15:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:33] T117618: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 [15:49:23] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Updated measurement of request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T415080#11537953 (10Ladsgroup) I've collected and analyzed requests to non-standard thumbnail sizes: 2/3rd are medium browser s... [15:50:21] (03CR) 10Elukey: "@jmeybohm@wikimedia.org hellooo do you think this is good now? Should be really easy to test in staging, to see if it works or not. Even i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [15:50:39] !log sudo cumin -b31 "A:cp" "run-puppet-agent --enable 'merging CR 1059423'": T117618 [15:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:57] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229139 [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1600). [16:02:44] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11538047 (10dancy) Thanks for the report @elukey. This sounds very promising! [16:07:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87788 and previous config saved to /var/cache/conftool/dbconfig/20260120-160659-marostegui.json [16:07:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:07:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:09:50] (03PS1) 10Giuseppe Lavagetto: haproxy: temporary fix for bot-passwords [puppet] - 10https://gerrit.wikimedia.org/r/1229141 [16:10:43] !log blake@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-codfw [16:12:21] (03PS1) 10Elukey: docker_registry: simplify and improve the /v2/ comment [puppet] - 10https://gerrit.wikimedia.org/r/1229143 [16:13:09] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7917/co" [puppet] - 10https://gerrit.wikimedia.org/r/1229141 (owner: 10Giuseppe Lavagetto) [16:14:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11538082 (10cmooney) [16:16:31] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229144 [16:16:41] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster::updatenetboot [puppet] - 10https://gerrit.wikimedia.org/r/1229107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:17:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P87789 and previous config saved to /var/cache/conftool/dbconfig/20260120-161707-marostegui.json [16:18:18] (03PS1) 10Elukey: DNM: docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) [16:21:08] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11538121 (10Jhancock.wm) @cmooney these have had os installs from nokia switches so far. E2: wikikube-worker2334 wikikube-worker2335... [16:22:55] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: temporary fix for bot-passwords [puppet] - 10https://gerrit.wikimedia.org/r/1229141 (owner: 10Giuseppe Lavagetto) [16:25:04] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11538136 (10greg) Thanks for the ping @FCeratto-WMF . I approve this request. [16:27:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P87790 and previous config saved to /var/cache/conftool/dbconfig/20260120-162715-marostegui.json [16:28:48] (03CR) 10JMeybohm: [C:03+1] "Sorry, fell through the cracks after I ran recheck." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:30:46] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229133 [16:32:14] (03CR) 10Kamila Součková: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229133 (owner: 10Scott French) [16:32:34] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11538196 (10RobH) p:05Medium→03High They've connected a crash cart and the host is hard down. Seems we have a bad mainboard or a bad PSU controller. I'm typing up directions for Jin@ DreamIIC fo... [16:32:57] FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:37:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87791 and previous config saved to /var/cache/conftool/dbconfig/20260120-163724-marostegui.json [16:37:32] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:37:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:37:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [16:37:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87792 and previous config saved to /var/cache/conftool/dbconfig/20260120-163748-marostegui.json [16:37:57] RESOLVED: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:44:22] jouncebot: nowandnext [16:44:22] For the next 0 hour(s) and 15 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1600) [16:44:23] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1700) [16:46:01] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229133 (owner: 10Scott French) [16:47:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T410589)', diff saved to https://phabricator.wikimedia.org/P87793 and previous config saved to /var/cache/conftool/dbconfig/20260120-164731-ladsgroup.json [16:47:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [16:48:16] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229133 (owner: 10Scott French) [16:50:30] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [16:50:58] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:51:29] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:51:43] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:52:14] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:52:31] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:53:02] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:53:18] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:53:47] (03PS1) 10Herron: mwlog: update partman config [puppet] - 10https://gerrit.wikimedia.org/r/1229152 (https://phabricator.wikimedia.org/T412230) [16:53:49] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:54:11] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:54:12] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:42] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:55:06] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:55:06] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11538348 (10MatthewVernon) One further note - cluster-wide metrics on sync delay (as opposed to the headers... [16:57:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P87794 and previous config saved to /var/cache/conftool/dbconfig/20260120-165740-ladsgroup.json [16:58:05] !log dancy@deploy2002 Installing scap version "4.233.0" for 2 host(s) [16:59:56] !log dancy@deploy2002 Installation of scap version "4.233.0" completed for 2 hosts [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260120T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:37] (03CR) 10Herron: [C:03+2] mwlog: update partman config [puppet] - 10https://gerrit.wikimedia.org/r/1229152 (https://phabricator.wikimedia.org/T412230) (owner: 10Herron) [17:02:09] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:02:43] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11538373 (10cmooney) >>! In T408757#11538119, @Jhancock.wm wrote: > @cmooney these have had os installs from nokia switches so far. Th... [17:02:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:03:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:03:53] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:04:12] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:25] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:04:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:05:15] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:05:35] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:05:54] RECOVERY - MegaRAID on db1171 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:06:07] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:06:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11538382 (10RobH) Update from meeting: * This switch need was noted by Willy on the Nokia A/B T412711 order and is being ordered there. * This task can... [17:06:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11538386 (10RobH) [17:06:31] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:07:03] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:07:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P87795 and previous config saved to /var/cache/conftool/dbconfig/20260120-170748-ladsgroup.json [17:08:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:08:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11538401 (10RobH) a:03VRiley-WMF @VRiley-WMF or @Jclark-ctr: With the switch being ordered on T412711, the only thing possibly pending order from the t... [17:10:48] (03CR) 10Thcipriani: [C:03+1] zuul: write TLS passphrase to a file for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1227735 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [17:11:03] (03PS8) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [17:13:06] (03CR) 10Cathal Mooney: [C:03+2] team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [17:14:20] (03Merged) 10jenkins-bot: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [17:17:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T410589)', diff saved to https://phabricator.wikimedia.org/P87796 and previous config saved to /var/cache/conftool/dbconfig/20260120-171757-ladsgroup.json [17:18:04] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [17:18:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [17:18:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T410589)', diff saved to https://phabricator.wikimedia.org/P87797 and previous config saved to /var/cache/conftool/dbconfig/20260120-171821-ladsgroup.json [17:21:13] (03PS1) 10FNegri: cloudnfs: Add wikiqlever project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1229159 (https://phabricator.wikimedia.org/T414986) [17:21:17] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:21:21] !log rotated statograph API key [17:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:22:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:22:47] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:22:54] (03CR) 10Vgutierrez: prometheus: add depooled cp* host check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [17:23:18] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:23:32] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:24:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:22] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:24:53] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:25:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:25:47] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:26:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply