[00:14:16] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:15:16] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [00:17:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10877968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be2008.codfw.wmnet with OS bullseye... [00:17:43] jclark@cumin1002 reimage (PID 3145923) is awaiting input [00:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:23:54] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#10877975 (10Jhancock.wm) opened a support ticket with supermicro #WNA-525-54109 [00:32:42] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 4 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/cd1158b2ddeab87c0bcc6c117fc7b816a5379dffd78a8e8650e4959066c50801/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:45:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [00:47:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [00:53:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878007 (10Jhancock.wm) @MatthewVernon hey having a little issue with this one. I did manage to get thanos-be2009 to partition and install the os,... [01:01:57] (03CR) 10Jdlrobson: [C:03+1] Deploy to en at twenty percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:08:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.4 [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1152863 (https://phabricator.wikimedia.org/T392174) [01:08:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.4 [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1152863 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [01:09:46] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:13:26] (03Abandoned) 10Jdlrobson: Fixes: TypeError: Cannot read properties of undefined (reading 'contains') [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148938 (owner: 10Jdlrobson) [01:21:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.4 [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1152863 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0200) [02:04:52] RECOVERY - dump of x3 in codfw on backupmon1001 is OK: Last dump for x3 at codfw (db2200) taken on 2025-06-03 00:44:30 (266 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:27:06] (03CR) 10Andrew Bogott: [C:03+2] eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [02:41:43] (03PS1) 10Andrew Bogott: Add haproxy frontend for octavia in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1152877 (https://phabricator.wikimedia.org/T393783) [02:41:45] (03PS1) 10Andrew Bogott: Add octavia mgmt cidr to cloudgw.yaml network defs [puppet] - 10https://gerrit.wikimedia.org/r/1152878 (https://phabricator.wikimedia.org/T393783) [02:43:15] (03CR) 10Andrew Bogott: [C:03+2] Add haproxy frontend for octavia in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1152877 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [02:44:05] (03CR) 10Andrew Bogott: [C:03+2] Add octavia mgmt cidr to cloudgw.yaml network defs [puppet] - 10https://gerrit.wikimedia.org/r/1152878 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [02:48:43] (03PS1) 10Andrew Bogott: Octavia/haproxy: fix copy/paste errors [puppet] - 10https://gerrit.wikimedia.org/r/1152879 [02:50:10] (03CR) 10Andrew Bogott: [C:03+2] Octavia/haproxy: fix copy/paste errors [puppet] - 10https://gerrit.wikimedia.org/r/1152879 (owner: 10Andrew Bogott) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0300) [03:01:53] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152881 (https://phabricator.wikimedia.org/T392174) [03:01:54] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152881 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [03:02:53] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152881 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [03:02:57] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.4 refs T392174 [03:03:04] T392174: 1.45.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T392174 [03:48:52] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs T392174 (duration: 45m 55s) [03:48:56] T392174: 1.45.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T392174 [03:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0400) [04:01:42] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) [04:04:52] RECOVERY - dump of x3 in eqiad on backupmon1001 is OK: Last dump for x3 at eqiad (db1216) taken on 2025-06-03 00:55:54 (266 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:29:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance [04:31:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2040 T395771', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json [04:31:54] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [04:33:34] (03PS1) 10Marostegui: mariadb: Productionize es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1152884 (https://phabricator.wikimedia.org/T395771) [04:35:00] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1152884 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [04:40:04] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet [04:42:05] (03PS1) 10Marostegui: instances.yaml: Add es2048 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1152886 (https://phabricator.wikimedia.org/T395771) [04:44:01] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2048 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1152886 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [04:45:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2048 to dbctl depooled T395771+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json [04:46:00] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [04:48:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2035 with weight 0 T395420', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json [04:48:58] T395420: Switchover es6 master (es2037 -> es2035) - https://phabricator.wikimedia.org/T395420 [04:49:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 T395420 [04:51:36] !log Starting es6 codfw failover from es2037 to es2035 - T395420 [04:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write T395420', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json [04:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2037 T395420', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json [04:54:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 T395420 [04:54:09] T395420: Switchover es6 master (es2037 -> es2035) - https://phabricator.wikimedia.org/T395420 [04:54:33] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1151610 (https://phabricator.wikimedia.org/T395420) (owner: 10Gerrit maintenance bot) [04:55:14] (03PS3) 10Giuseppe Lavagetto: cp2027: remove experimental connection-rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1152081 [04:55:14] (03PS4) 10Giuseppe Lavagetto: cache::haproxy: remove unused if stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1152028 [04:55:14] (03PS5) 10Giuseppe Lavagetto: cache::haproxy: start optimizing for readability [puppet] - 10https://gerrit.wikimedia.org/r/1152029 [04:55:14] (03PS3) 10Giuseppe Lavagetto: cache::haproxy: remove generic ring definition [puppet] - 10https://gerrit.wikimedia.org/r/1152082 [04:55:15] (03PS5) 10Giuseppe Lavagetto: cache::haproxy: remove unused variables from configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152083 [04:55:17] (03PS4) 10Giuseppe Lavagetto: cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 [04:55:21] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) [04:59:27] FIRING: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:59:59] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1152890 (https://phabricator.wikimedia.org/T395867) [05:00:03] (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1152891 (https://phabricator.wikimedia.org/T395867) [05:03:35] RESOLVED: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:07:12] RECOVERY - Disk space on restbase2030 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2030&var-datasource=codfw+prometheus/ops [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:11] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1152893|db-production.php: Disable writes on es6 (T395867)]] [05:09:14] T395867: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T395867 [05:09:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 T395867 [05:09:46] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:10:20] (03PS1) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152893 (https://phabricator.wikimedia.org/T395867) [05:11:40] (03CR) 10Marostegui: [C:03+2] db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152893 (https://phabricator.wikimedia.org/T395867) (owner: 10Marostegui) [05:12:41] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152893 (https://phabricator.wikimedia.org/T395867) (owner: 10Marostegui) [05:13:14] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1152893|db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:13:17] (03PS1) 10Giuseppe Lavagetto: haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 [05:14:02] !log marostegui@deploy1003 marostegui: Continuing with sync [05:15:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 T395867 [05:15:23] T395867: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T395867 [05:16:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:22:51] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152893|db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) [05:22:54] T395867: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T395867 [05:23:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1038 with weight 0 T395867', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json [05:25:53] !log Starting es6 eqiad failover from es1037 to es1038 - T395867 [05:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1038 to es6 primary T395867', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json [05:26:37] !log marostegui@dns1006 START - running authdns-update [05:26:47] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1152890 (https://phabricator.wikimedia.org/T395867) (owner: 10Gerrit maintenance bot) [05:27:18] !log marostegui@dns1006 END - running authdns-update [05:27:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1037 T395867', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json [05:29:15] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1152896|Revert "db-production.php: Disable writes on es6"]] [05:30:17] (03CR) 10Marostegui: [C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1152891 (https://phabricator.wikimedia.org/T395867) (owner: 10Gerrit maintenance bot) [05:30:38] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152896 [05:31:22] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1152896|Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:31:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json [05:32:04] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152896 (owner: 10Marostegui) [05:32:04] !log marostegui@deploy1003 marostegui: Continuing with sync [05:33:08] (03PS1) 10Marostegui: es1037: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152897 (https://phabricator.wikimedia.org/T394469) [05:33:17] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152896 (owner: 10Marostegui) [05:35:54] (03CR) 10Marostegui: [C:03+2] es1037: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152897 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui) [05:39:08] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152896|Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) [05:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json [05:45:20] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:46:18] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [05:46:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json [05:47:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:49:39] FIRING: CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:52:52] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0600) [06:00:05] marostegui, Amir1, and federico3: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0600). nyaa~ [06:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:07:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T395241)', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json [06:07:52] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:09:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:11:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json [06:14:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T395241)', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json [06:16:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json [06:26:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json [06:28:55] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:29:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 T390954 [06:29:38] T390954: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954 [06:29:38] (03PS1) 10Marostegui: clouddb1016: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1153018 (https://phabricator.wikimedia.org/T390954) [06:30:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json [06:30:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 T390954 [06:30:47] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 783, active_shards: 1856, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [06:30:47] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:31:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json [06:37:18] !log Decrease buffer size on clouddb1016:s8 T390954 [06:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:22] T390954: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954 [06:41:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json [06:45:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json [06:45:19] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org [06:45:21] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [06:46:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json [06:48:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye [06:48:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-be2009.codfw.wmnet with OS bullse... [06:51:03] jmm@cumin1003 makevm (PID 161956) is awaiting input [06:51:38] (03PS1) 10Ayounsi: Prometheus: gnmic_target_up rewrite name to instance [puppet] - 10https://gerrit.wikimedia.org/r/1153026 (https://phabricator.wikimedia.org/T388641) [06:53:58] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" [06:54:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" [06:54:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:54:02] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors [06:54:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors [06:54:29] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" [06:54:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" [06:57:34] jmm@cumin1003 makevm (PID 161956) is awaiting input [06:57:57] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:58:47] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 783, active_shards: 1856, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [06:58:47] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:58:51] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm [06:59:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host bast7002.wikimedia.org with OS bookworm [06:59:10] (03CR) 10Muehlenhoff: [C:03+2] CAS: Add service definition for Zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) (owner: 10Muehlenhoff) [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T0700). [07:00:04] Tchanders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T395241)', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json [07:00:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [07:00:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T395241)', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json [07:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json [07:02:24] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:02:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:06:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:06:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:10:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T395241)', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json [07:12:15] I'll deploy my patch [07:14:45] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1142649|Assign IP auto-reveal rights to certain groups (T386492)]] [07:14:48] T386492: IP auto-reveal: Assign the IP auto-reveal right to user groups - https://phabricator.wikimedia.org/T386492 [07:16:49] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1142649|Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:18:25] !log tchanders@deploy1003 tchanders: Continuing with sync [07:21:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878399 (10MatthewVernon) @Jhancock.wm thanks for this; I tried running puppet on thanos-be2009 and the problem was that /dev/sdi1 had an EFI parti... [07:23:40] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage [07:23:52] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye [07:24:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878404 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be2009.codfw.wmnet with OS bul... [07:25:24] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142649|Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) [07:25:27] T386492: IP auto-reveal: Assign the IP auto-reveal right to user groups - https://phabricator.wikimedia.org/T386492 [07:26:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json [07:27:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage [07:30:00] Tchanders: That's all the scheduled patches for the window, right? [07:37:55] !log jmm@cumin1003 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master [07:38:20] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [07:38:24] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [07:41:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json [07:43:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm [07:43:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org [07:44:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host bast7002.wikimedia.org with OS bookworm completed: - bast7002 (**WARN**) - Remov... [07:44:42] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [07:44:45] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [07:46:02] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [07:46:34] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet [07:46:40] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:49:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master [07:49:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [07:49:57] OK. SpiderPig says that all deployments are finished. I'm going to enable the SDS 2.4.11 Synthetic A/A Test experiment in xLab for 5 minutes. I'll be monitoring _all the logs_ throughout [07:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:52:18] jmm@cumin1003 makevm (PID 169791) is awaiting input [07:56:19] !log jmm@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:56:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T395241)', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json [07:56:27] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet [07:56:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:56:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T395241)', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json [07:57:06] !log Enabling the SDS 2.4.11 Synthetic A/A Test in xLab [07:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:02] !log Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab [08:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T395241)', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json [08:06:53] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [08:09:59] mvernon@cumin1002 reimage (PID 3277510) is awaiting input [08:10:16] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [08:10:16] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye [08:10:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be2009.codfw.wmnet with OS bullsey... [08:10:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet [08:14:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet [08:14:42] (03CR) 10Ilias Sarantopoulos: [C:03+1] ores-extension: enable extension with revertrisk filter for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [08:16:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:16:47] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye [08:17:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be2008.codfw.wmnet with OS bul... [08:17:12] (03CR) 10Vgutierrez: "looking good, I've added a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [08:19:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:21:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json [08:22:37] !log rearm keyholder on cumin1003 following reboot [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:24:28] RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:30:06] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm [08:34:28] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm [08:34:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm [08:36:07] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:36:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json [08:37:52] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:38:51] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage [08:40:05] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:41:41] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:42:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage [08:43:24] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:45:08] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:51:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T395241)', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json [08:51:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [08:51:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T395241)', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json [08:52:24] (03CR) 10Giuseppe Lavagetto: [C:04-1] "See a couple of code comments; but also: I'm not sure this will work at all with some of the tests we run that require read-write access t" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [08:52:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:53:27] (03PS1) 10Majavah: hieradata: Update Striker to 2025-06-02-141244-production [puppet] - 10https://gerrit.wikimedia.org/r/1153099 [08:54:10] (03PS1) 10Muehlenhoff: Assign bastion role to bast7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153101 (https://phabricator.wikimedia.org/T394263) [08:54:11] (03CR) 10Volans: "couple of optional suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:54:38] (03PS2) 10Muehlenhoff: Assign bastion role to bast7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153101 (https://phabricator.wikimedia.org/T394263) [08:57:45] (03CR) 10Giuseppe Lavagetto: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [08:59:16] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [09:00:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T395241)', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json [09:02:22] mvernon@cumin1002 reimage (PID 3332792) is awaiting input [09:06:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [09:06:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye [09:06:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be2008.codfw.wmnet with OS bullsey... [09:09:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10878673 (10MatthewVernon) @Jhancock.wm @Jclark-ctr I've now re-imaged thanos-be2008 and thanos-be2009 OK. The problem in both cases was that there... [09:09:47] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:10:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:10:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:10:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878678 (10MoritzMuehlenhoff) [09:15:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json [09:16:01] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10878693 (10Volans) No objection from my side. I can look if there are other alternative options in addition to those mentioned, but I'm not sure there is any. [09:16:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:16:16] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:17:25] 06SRE, 06Traffic: Move ncredir7003 into service and decom ncredir7002 - https://phabricator.wikimedia.org/T395796#10878694 (10MoritzMuehlenhoff) [09:18:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:18:44] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning [09:18:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:20:06] jelto@cumin1002 reimage (PID 3346180) is awaiting input [09:20:10] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5753/co" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [09:20:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878709 (10MoritzMuehlenhoff) [09:21:27] (03PS3) 10Volans: DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 [09:21:52] marostegui@cumin1002 clone (PID 3188453) is awaiting input [09:22:01] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/output/1152029/5753/cp4050.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [09:22:15] !log puppet cert destroy {mobileapps,proton,recommendation-api}.discovery.wmnet on puppetmaster1001 - old certs not used anymore [09:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:46] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [09:23:00] 06SRE, 07SRE-Unowned, 10Maps: New apus account for Tegola - https://phabricator.wikimedia.org/T395659#10878749 (10MatthewVernon) I'm not inclined to move this right now, I think - thanos-swift already gives you S3 protocol support and cross-DC replication, and a very large number of small objects is not a su... [09:24:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] cp2027: remove experimental connection-rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1152081 (owner: 10Giuseppe Lavagetto) [09:25:21] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:25:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove unused if stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1152028 (owner: 10Giuseppe Lavagetto) [09:25:32] (03PS1) 10Cathal Mooney: Cloudsw: rename sw_ibgp policy to ibgp_out [homer/public] - 10https://gerrit.wikimedia.org/r/1153103 (https://phabricator.wikimedia.org/T394530) [09:25:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [09:27:17] (03CR) 10Cathal Mooney: [C:03+2] Cloudsw: rename sw_ibgp policy to ibgp_out [homer/public] - 10https://gerrit.wikimedia.org/r/1153103 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:27:24] (03CR) 10Ayounsi: [C:03+2] Remove magru RIPE Atlas Anchor [puppet] - 10https://gerrit.wikimedia.org/r/1153102 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:28:36] (03Merged) 10jenkins-bot: Cloudsw: rename sw_ibgp policy to ibgp_out [homer/public] - 10https://gerrit.wikimedia.org/r/1153103 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:29:33] 06SRE, 07SRE-Unowned, 10Maps: New apus account for Tegola - https://phabricator.wikimedia.org/T395659#10878760 (10elukey) 05Open→03Resolved a:03elukey @MatthewVernon it is not pressing, we can definitely revisit later on, but it will probably be months away. If it is not a problem for the Thanos ->... [09:29:46] RESOLVED: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:30:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:30:32] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10878768 (10elukey) >>! In T381565#10870440, @elukey wrote: > > Since this is not complex enough, we may add a little extra complexity and also migrate the S3 configuration for Tegol... [09:31:32] (03CR) 10Vgutierrez: [C:03+2] varnish: Don't let wmfuniq_experiment_fetcher crash if endpoint is unavailable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:32:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:32:39] (03CR) 10Ayounsi: [C:03+1] Grant firewall access for bast7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153104 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:36:08] (03CR) 10Volans: [C:03+2] DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [09:40:09] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7001 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:43:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm [09:44:25] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [09:46:25] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [09:47:25] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [09:48:28] (03PS1) 10Vgutierrez: varnish: Fix wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1153106 (https://phabricator.wikimedia.org/T391411) [09:59:49] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm [10:00:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10878847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm executed with er... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1000) [10:00:09] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7001 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:30] (03PS3) 10Vgutierrez: varnish: Provide basic logging and metrics for wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) [10:07:50] (03CR) 10Muehlenhoff: [C:03+2] Grant firewall access for bast7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153104 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:12:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:12:58] <_joe_> effie: are you doing anything with memcached? ^^ [10:13:52] I am doing nothing with memcached, I am looking [10:13:55] !log drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 T387504 [10:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:58] T387504: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504 [10:14:21] <_joe_> effie: it's already over, but seemed like a rather serious issue given the numbers [10:14:27] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:15:25] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: start optimizing for readability [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [10:15:27] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [10:16:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove generic ring definition [puppet] - 10https://gerrit.wikimedia.org/r/1152082 (owner: 10Giuseppe Lavagetto) [10:17:05] (03PS5) 10Giuseppe Lavagetto: cache::haproxy: remove generic ring definition [puppet] - 10https://gerrit.wikimedia.org/r/1152082 [10:17:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:18:15] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [10:20:13] _joe_: we think it is the ferm reapply issue, we are working with it with moritz [10:20:33] <_joe_> ack [10:20:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:22:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:23:35] effie, _joe_: I was planning on re-enabling an experiment. This would mean a slight increase in traffic reaching the app servers. Can I proceed, given the memcached issue? [10:23:57] effie, _joe_: indeed, it's that. shortly before the alert went off I merged https://gerrit.wikimedia.org/r/1153104 which modifies the SSH firewall config fleet-wide [10:24:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:25:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json [10:25:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json [10:25:38] ^ It's firing again. I'll hold off :) [10:25:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:27:10] phuedx: this is being propagated as time passes, I would say give it another 15' and go for it [10:27:26] moritzm: aye, I checked the puppet logs [10:27:35] effie: ACK [10:28:30] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10878920 (10Ladsgroup) [10:28:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:28:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:28:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:29:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:30:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:32:08] (03CR) 10CI reject: [V:04-1] Add alerting for ALIS availability & latency monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1153116 (https://phabricator.wikimedia.org/T386116) (owner: 10Cyndywikime) [10:35:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:36:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:38:14] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured [10:38:20] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10878954 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4361b0dd-e1a8-43e5-ae62-03d241fc927c) set by cmooney@cumin1002 for 0:30:00 on 11 host(s) and their services with r... [10:38:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10878956 (10MoritzMuehlenhoff) For the reimages to succeed, these need to be re-provisioned with EFI, we adapted the install procedure while debugging some... [10:40:11] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet [10:40:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T395241)', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json [10:40:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:40:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json [10:41:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:43:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet [10:43:29] (03PS15) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [10:43:29] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:43:44] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [10:44:04] (03CR) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [10:44:33] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet [10:44:35] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [10:46:09] effie: I make that 15. I've seen the memcached error fluttering. Should I still hold off? [10:46:33] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:47:27] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [10:48:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT leases) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json [10:48:27] phuedx: hangon because he seem to have another problem [10:49:27] phuedx: I would say go ahead, the memcached errors seem to be going back to normal [10:49:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet [10:49:54] (03CR) 10Hnowlan: [C:03+1] mw::maint::recount_categories: foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1153108 (https://phabricator.wikimedia.org/T395745) (owner: 10Clément Goubert) [10:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a8-codfw and cr2-codfw (10.192.254.6) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a8-codfw:9804&var-bgp_group=core&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:51:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [10:51:30] ^^ these are due to work on cr2-codfw [10:52:00] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet [10:53:03] RESOLVED: [5x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d8-codfw:et-0/0/31 (Core: cr2-codfw:et-0/0/2 {#122403}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d8-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:55:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:57:00] effie: ACK [10:57:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet [10:57:16] !log Enabling the SDS 2.4.11 Synthetic A/A Test in xLab [10:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:20] (03PS2) 10Vgutierrez: liberica: Don't set forwarding_cores/numa_node for katran [puppet] - 10https://gerrit.wikimedia.org/r/1151644 (https://phabricator.wikimedia.org/T395228) [10:59:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/31 (Core: cr2-codfw:et-0/1/2 {#230403800040}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:00:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:03:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json [11:04:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/31 (Core: cr2-codfw:et-0/1/2 {#230403800040}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:05:39] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:06:19] RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [11:07:06] (03PS3) 10Cyndywikime: Add alerting for ALIS availability [alerts] - 10https://gerrit.wikimedia.org/r/1153116 (https://phabricator.wikimedia.org/T386116) [11:08:36] RESOLVED: [10x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:1 (Core: cr2-codfw:xe-0/0/1:2 {#10695_12273-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:09:24] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:11:50] jouncebot: nowandnext [11:11:50] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [11:11:50] In 0 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1200) [11:14:24] (03PS11) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [11:15:17] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1153027 (owner: 10Muehlenhoff) [11:17:29] (03CR) 10Clément Goubert: [C:03+2] mw::maint::recount_categories: foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1153108 (https://phabricator.wikimedia.org/T395745) (owner: 10Clément Goubert) [11:18:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json [11:21:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T395785 [11:21:19] T395785: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T395785 [11:22:26] (03CR) 10Majavah: [C:03+2] conftool-data: Add x3 wiki replica backend services [puppet] - 10https://gerrit.wikimedia.org/r/1149603 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [11:22:35] !log taavi@cumin1002 conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 [11:22:36] (03PS1) 10Muehlenhoff: Also add replica label for the new upcoming prometheus7002 node [puppet] - 10https://gerrit.wikimedia.org/r/1153126 (https://phabricator.wikimedia.org/T394263) [11:23:14] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudlb: Add x3 wiki replica backend service [puppet] - 10https://gerrit.wikimedia.org/r/1149604 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [11:25:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning [11:25:46] jouncebot: next [11:25:46] In 0 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1200) [11:28:12] (03PS1) 10Marostegui: es1039: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153129 (https://phabricator.wikimedia.org/T395647) [11:29:12] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1153130|db-production.php: Disable writes on es7 (T395647)]] [11:29:14] T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647 [11:29:36] (03CR) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:29:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T395785 [11:29:57] T395785: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T395785 [11:30:13] (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153130 (https://phabricator.wikimedia.org/T395647) [11:31:23] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1153130|db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:32:02] !log marostegui@deploy1003 marostegui: Continuing with sync [11:33:28] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:36:37] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet [11:36:59] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10879217 (10MatthewVernon) Looks like you've just finished the codfw 3x ones, so I looked: wikipedia-commons-local-thumb.30 838,472 objects 96,830,126,986 bytes wikipedia-commons-loca... [11:38:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:38:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:38:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:39:08] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153130|db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) [11:39:11] T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647 [11:39:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json [11:39:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:39:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039 T395647', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json [11:39:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json [11:40:29] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10879222 (10cmooney) >>! In T387504#10871698, @cmooney wrote: > @Jhancock.wm as discussed on irc the link from ssw1-e1-codfw is working fine, however the link from ssw1-f1-codfw to cr2-codfw... [11:40:35] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:41:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet [11:43:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:43:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:43:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:44:33] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:46:05] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:46:27] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [11:46:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T395785 [11:46:30] T395785: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T395785 [11:46:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2039 with weight 0 T395785', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json [11:48:01] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10879247 (10Ladsgroup) The reason I didn't ping you is that when I got to ms-fe, all screens were terminated which might mean it was cut (and rebooted?) halfway through the deletion,... [11:48:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json [11:48:48] !log Starting es7 codfw failover from es2038 to es2039 - T395785 [11:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write T395785', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json [11:49:34] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:50:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2038 T395785', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json [11:52:04] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:54:11] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:55:13] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:56:00] (03PS4) 10Clément Goubert: mw::periodic_job: Move foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1153128 (https://phabricator.wikimedia.org/T395745) [11:56:53] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:59:57] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1200) [12:00:54] (03CR) 10Ayounsi: [C:03+1] Update magru bastion for ssh-client-config and tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1153110 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:00:56] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [12:02:48] jouncebot: next [12:02:49] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1300) [12:02:50] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:03:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json [12:03:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json [12:04:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json [12:04:40] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:05:12] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153135 (owner: 10Marostegui) [12:05:59] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:06:23] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:06:56] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1153135|Revert "db-production.php: Disable writes on es7"]] [12:07:27] !log Launching manual run of recount-categories cronjob - T395745 [12:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:29] T395745: MediaWiki periodic job recount-categories failed - https://phabricator.wikimedia.org/T395745 [12:08:18] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153135 (owner: 10Marostegui) [12:08:34] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [12:09:02] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1153135|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:09:45] !log marostegui@deploy1003 marostegui: Continuing with sync [12:10:40] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:10:54] (03PS21) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental #5 [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [12:12:46] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:13:38] (03PS10) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [12:15:22] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:16:13] (03PS4) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [12:16:43] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153135|Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) [12:18:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json [12:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json [12:19:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json [12:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:25:16] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning [12:30:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:30:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10879344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:31:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10879348 (10Jclark-ctr) [12:32:14] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.2 release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1153138 (owner: 10Muehlenhoff) [12:33:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json [12:33:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:33:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json [12:34:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json [12:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json [12:35:11] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [12:40:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [12:42:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json [12:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-f1-codfw and cr2-codfw (10.192.253.174) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-f1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json [12:49:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json [12:50:15] (03PS1) 10Cathal Mooney: Typo in bgp peer definition on cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1153141 (https://phabricator.wikimedia.org/T394021) [12:56:56] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Move x3 VIP to new x3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1149605 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:57:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json [12:58:50] (03CR) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature #4 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [12:58:58] !log uploaded wmf-laptop 1.0.2 to apt.wikimedia.org [12:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:30] o/ [13:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-f1-codfw and cr2-codfw (10.192.253.174) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-f1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:00:43] nothing in the calendar at the moment [13:02:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 T390954 [13:02:49] T390954: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954 [13:04:14] !log Shutdown clouddb1016:x3 T390954 [13:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json [13:04:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json [13:08:30] (03CR) 10Cathal Mooney: [C:03+1] gNMI: add target down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1153030 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:08:46] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:09:21] (03CR) 10Cathal Mooney: [C:03+1] Prometheus: gnmic_target_up rewrite name to instance [puppet] - 10https://gerrit.wikimedia.org/r/1153026 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:11:00] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:11:43] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:12:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json [13:12:31] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:12:44] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [13:12:49] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:14:54] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [13:15:08] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:16:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [13:16:28] !log installing libavif security updates [13:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:05] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:18:43] (03PS1) 10Marostegui: clouddb1020.yaml: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1153145 (https://phabricator.wikimedia.org/T390954) [13:18:50] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:18:56] (03PS3) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 [13:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json [13:20:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json [13:20:09] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:20:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [13:21:31] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:22:58] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:25:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:00] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:27:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json [13:27:42] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:27:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [13:28:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T395241)', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json [13:30:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10879473 (10Milimetric) @Arnoldokoth: yes, approved! Thanks for the ping. [13:30:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:05] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:32:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sde) failed in moss-be1002 - https://phabricator.wikimedia.org/T395103#10879477 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Failed Drive [13:32:22] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:32:46] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:33:53] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json [13:34:41] (03CR) 10Marostegui: [C:04-2] "The objectstash host already exists, that's the issue. We can maybe just rename the prompt and the files themselves?" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [13:34:52] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:37:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T395241)', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json [13:38:01] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:38:22] PROBLEM - mysqld processes on clouddb1020 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:39:47] (03PS1) 10Muehlenhoff: Add library hint for libavif [puppet] - 10https://gerrit.wikimedia.org/r/1153148 [13:40:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [13:40:53] RESOLVED: [2x] KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10879490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye com... [13:41:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10879491 (10Jclark-ctr) [13:41:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10879492 (10Jclark-ctr) 05Open→03Resolved [13:42:31] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:43:32] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:44:46] !log Disabled the SDS 2.4.11 Synthetic A/A Test in xLab [13:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:22] RECOVERY - mysqld processes on clouddb1020 is OK: PROCS OK: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:45:52] (03CR) 10CI reject: [V:04-1] New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:46:32] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json [13:51:37] (03PS4) 10Ayounsi: New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:52:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json [13:52:37] (03CR) 10Tiziano Fogli: [C:03+1] site.pp: Fix entry for prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153153 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:53:28] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 544.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:54:06] (03CR) 10Muehlenhoff: [C:03+2] site.pp: Fix entry for prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153153 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:56:07] (03PS1) 10Andrew Bogott: Octavia config updates [puppet] - 10https://gerrit.wikimedia.org/r/1153156 (https://phabricator.wikimedia.org/T393783) [13:57:46] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm [13:57:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10879559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm [13:58:59] (03CR) 10Volans: sre.cdn.roll-restart-ats: add cookbook for restarting ATS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [13:59:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sde) failed in moss-be1002 - https://phabricator.wikimedia.org/T395103#10879562 (10MatthewVernon) Thanks :) [14:01:09] !log dropping term store tables from s8 (T351802) [14:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:12] T351802: Wikibase: Introduce separate database configuration for term store - https://phabricator.wikimedia.org/T351802 [14:01:33] !log dropping term store tables from s8 (T351820) [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:36] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [14:03:13] jouncebot: nowandnext [14:03:13] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [14:03:14] In 0 hour(s) and 56 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1500) [14:04:25] Amir1: sigh, did those have to be dropped before we can get the x3 wiki replicas in service? :( [14:04:38] taavi: I'm skipping that replica [14:04:44] ah perfect, thank you [14:05:10] Reedy: should we backport the FancyCaptcha patches? or would you rather do that tomorrow? [14:07:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json [14:17:54] (03CR) 10Volans: sre.cdn.roll-restart-ats: add cookbook for restarting ATS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [14:19:07] (03CR) 10Ayounsi: "I did a first pass, the logic lgtm." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [14:19:38] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet [14:19:50] (03CR) 10Tiziano Fogli: "LGTM, see comments in line" [alerts] - 10https://gerrit.wikimedia.org/r/1153091 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:19:50] 06SRE, 10SRE-Access-Requests: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10879640 (10cmassaro) 05Resolved→03Open a:05cmassaro→03None [14:20:41] 06SRE, 10SRE-Access-Requests: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10879643 (10cmassaro) Hello! @BCornwall , I am re-opening this one. I have now received the correct computer from ITS and need my key rotated. The new key is ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKzA1ewi1fQ84Inku... [14:22:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T395241)', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json [14:23:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [14:23:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T395241)', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json [14:23:55] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage [14:26:19] (03CR) 10Tiziano Fogli: [C:03+1] Prometheus: gnmic_target_up rewrite name to instance [puppet] - 10https://gerrit.wikimedia.org/r/1153026 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:26:52] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet [14:27:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage [14:28:00] (03PS7) 10Cathal Mooney: New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) [14:30:10] (03CR) 10Ayounsi: [C:03+2] Prometheus: gnmic_target_up rewrite name to instance [puppet] - 10https://gerrit.wikimedia.org/r/1153026 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:30:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T395241)', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json [14:33:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10879725 (10Jhancock.wm) @cmooney I'm gonna reply to Jorge's email about boxes and pickup instructions. Not trying to rush, but... [14:33:50] (03CR) 10Tiziano Fogli: [C:03+1] gNMI: add target down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1153030 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:35:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10879732 (10MoritzMuehlenhoff) [14:37:33] (03CR) 10Tiziano Fogli: [C:03+1] Add alerting for gNMIc Go routines (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1153091 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:43:21] (03PS14) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [14:45:04] (03PS15) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [14:45:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json [14:46:10] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#10879802 (10Jdforrester-... [14:46:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm [14:46:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10879810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm completed: - pro... [14:50:19] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [14:52:54] (03CR) 10Hashar: [C:04-1] "The mismatch of Unix username between the hosts has and is causing issue. The new host gerrit2003 should have been setup with the same use" [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [14:53:01] (03CR) 10Ssingh: "I am still confused, please forgive me. If we stop Pybal on lvs1017, lvs1020 will take over anyway? So for the duration of this event unti" [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [14:54:37] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [14:54:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [14:58:40] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [14:58:45] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [15:00:05] jelto, arnoldokoth, and mutante: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1500). [15:00:37] (03CR) 10Ssingh: "OK thanks, I think that was my understanding as per the task and the planning on IRC: stop Pybal on lvs1017, decom right away and let lvs1" [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [15:00:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json [15:03:42] 06SRE, 10SRE-SLO, 10Observability-Metrics: Pyrra detail grafana dashboard contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10879863 (10elukey) [15:03:55] 06SRE, 10SRE-SLO, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10879864 (10elukey) [15:04:21] I am going to restart Gerrit [15:04:25] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10879865 (10hashar) 05Resolved→03Open I have reverted the replica configuration since that broke GitHub replication and the new gerrit2003 host was mi... [15:04:29] 06SRE, 10SRE-SLO, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10879871 (10elukey) [15:04:32] cause T395887 [15:04:33] T395887: github mirror out of sync - https://phabricator.wikimedia.org/T395887 [15:04:46] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q3): liftwing SLO performance issues - https://phabricator.wikimedia.org/T387350#10879874 (10elukey) [15:05:04] I am wating for some changes in CI to merge ( https://integration.wikimedia.org/zuul/ ) [15:05:21] 10SRE-SLO, 10EditCheck, 06Editing-team, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10879881 (10elukey) [15:05:30] (03CR) 10Tchanders: Assign IP auto-reveal rights to certain groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [15:06:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 (10Andrew) 03NEW [15:06:18] !log Restarted Gerrit due to issue with replication config | T395887 [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:24] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10879896 (10Andrew) a:05Jclark-ctr→03Andrew [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.063s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:10:02] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [15:10:37] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [15:10:42] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" [15:11:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10879906 (10VirginiaPoundstone) Approved! [15:11:07] (03PS18) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [15:12:17] (03PS2) 10MVernon: swift: remove ms-be2080 entirely from rings prior to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138831 (https://phabricator.wikimedia.org/T354872) [15:12:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10879912 (10VirginiaPoundstone) a:05VirginiaPoundstone→03Arnoldokoth [15:12:28] (03CR) 10Scott French: profile::kubernetes::deployment_server: add new mw-experimental release #2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:15:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T395241)', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json [15:15:53] (03CR) 10Jforrester: Deploy to en at twenty percent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [15:16:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:56] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [15:18:03] !log installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [15:18:47] (03CR) 10MVernon: "Verified no weight on this node any more thus:" [puppet] - 10https://gerrit.wikimedia.org/r/1138831 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:19:12] (03Abandoned) 10CDobbins: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [15:19:19] (03PS19) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [15:20:23] (03CR) 10Marostegui: [C:03+1] swift: remove ms-be2080 entirely from rings prior to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138831 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:22:09] (03CR) 10Scott French: "Looks like this is all done, with the exception of the depends-on change. LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:22:18] (03CR) 10Scott French: [C:03+1] admin_ng: add mw-experimental namespace with hostPath support #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:23:40] (03PS1) 10Muehlenhoff: Record LDAP access for llugo [puppet] - 10https://gerrit.wikimedia.org/r/1153168 [15:23:41] (03CR) 10MVernon: [C:03+1] sessionstore2004: reimage as JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153150 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:24:33] (03CR) 10MVernon: [C:03+1] "Looks plausible to me, but I am far from a partman expert!" [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:24:55] (03CR) 10MVernon: [C:03+2] swift: remove ms-be2080 entirely from rings prior to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138831 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:25:20] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [15:26:33] (03PS2) 10Muehlenhoff: Record LDAP access for llugo [puppet] - 10https://gerrit.wikimedia.org/r/1153168 [15:28:19] jouncebot: now [15:28:19] For the next 0 hour(s) and 31 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1500) [15:28:20] (03CR) 10JMeybohm: [C:03+1] "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [15:28:24] jouncebot: next [15:28:25] In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1600) [15:29:10] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for llugo [puppet] - 10https://gerrit.wikimedia.org/r/1153168 (owner: 10Muehlenhoff) [15:30:56] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng: add policy for /srv/mediawiki hostPath mounts #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151208 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [15:31:39] (03PS1) 10Hnowlan: mw::maintenance: don't run purge-old-cx-drafts against test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1153171 (https://phabricator.wikimedia.org/T395892) [15:31:58] (03CR) 10Effie Mouzeli: [C:03+2] validating-admission-policies: add policy to permit hostPath mounts for mediawiki #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [15:32:07] (03PS1) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 [homer/public] - 10https://gerrit.wikimedia.org/r/1153172 (https://phabricator.wikimedia.org/T394530) [15:32:22] 10SRE-SLO, 10observability: Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916 (10elukey) 03NEW [15:33:04] (03Restored) 10BCornwall: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [15:33:29] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:33:46] (03CR) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1153172 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [15:34:07] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T395917 (10Anton.Kokh) 03NEW [15:34:28] (03PS2) 10Kimberly Sarabia: Deploy survey to en at twenty percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) [15:34:46] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10880081 (10Anton.Kokh) [15:35:07] (03CR) 10Effie Mouzeli: [C:03+2] profile::kubernetes::deployment_server: add usernames for mw-experimental #1 [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:36:30] (03CR) 10Effie Mouzeli: [C:03+2] profile::kubernetes::deployment_server: add new mw-experimental release #2 [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:37:47] (03PS2) 10Jdlrobson: Enable dark mode on Wikidata for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152855 (https://phabricator.wikimedia.org/T395919) [15:37:51] (03Merged) 10jenkins-bot: admin_ng: add policy for /srv/mediawiki hostPath mounts #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151208 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [15:38:18] (03PS1) 10Andrew Bogott: octavia: configure 'octavia' service project ID [puppet] - 10https://gerrit.wikimedia.org/r/1153174 (https://phabricator.wikimedia.org/T393783) [15:38:24] (03CR) 10Jdlrobson: [C:04-1] "Chatted with Lucas about this today and it seems there a few outstanding problems with dark mode that need to be resolved before turning t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152855 (https://phabricator.wikimedia.org/T395919) (owner: 10Jdlrobson) [15:38:54] (03Merged) 10jenkins-bot: validating-admission-policies: add policy to permit hostPath mounts for mediawiki #0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [15:39:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153174 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [15:39:47] 10SRE-SLO: Add a section to the SLO template that explains Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920 (10elukey) 03NEW [15:40:15] (03PS2) 10Andrew Bogott: octavia: configure 'octavia' service project ID [puppet] - 10https://gerrit.wikimedia.org/r/1153174 (https://phabricator.wikimedia.org/T393783) [15:40:17] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10880182 (10klausman) With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided som... [15:40:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153174 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [15:40:56] (03CR) 10Dzahn: "Sorry, I am not going to revert everything already done on gerrit2003 or touch existing prod machine just to avoid 4 more lines in puppet " [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [15:42:46] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1152854 (https://phabricator.wikimedia.org/T395521) (owner: 10Scott French) [15:43:01] (03CR) 10Scott French: [C:03+2] deployment_server: Update the local helm cache in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1152854 (https://phabricator.wikimedia.org/T395521) (owner: 10Scott French) [15:43:09] (03CR) 10Andrew Bogott: [C:03+2] octavia: configure 'octavia' service project ID [puppet] - 10https://gerrit.wikimedia.org/r/1153174 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [15:43:44] (03CR) 10Effie Mouzeli: admin_ng: add mw-experimental namespace with hostPath support #3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:43:50] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng: add mw-experimental namespace with hostPath support #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:43:56] (03CR) 10Dzahn: "I am aware of all this and have pending patches for fixing this, had a session with Tyler on this to keep the team informed and we are in " [puppet] - 10https://gerrit.wikimedia.org/r/1153159 (https://phabricator.wikimedia.org/T395887) (owner: 10Hashar) [15:50:51] !log jiji@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:51:09] jouncebot: now [15:51:09] For the next 0 hour(s) and 8 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1500) [15:51:14] jouncebot: next [15:51:14] In 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1600) [15:51:17] (03PS1) 10Dzahn: Revert^2 "gerrit: add a second replica, start replicating to gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 [15:51:28] !log jiji@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:51:37] !log jiji@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:52:15] (03PS1) 10Kevin Bazira: ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153266 [15:52:51] (03Merged) 10jenkins-bot: admin_ng: add mw-experimental namespace with hostPath support #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:53:29] (03PS1) 10Gmodena: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153273 (https://phabricator.wikimedia.org/T347282) [15:54:01] (03PS1) 10Cathal Mooney: IBGP_OUT policy: rename last term and also export statics [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) [15:54:36] !log jiji@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:55:06] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10880432 (10WMDE-leszek) [15:55:12] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10880434 (10WMDE-leszek) I approve this request on WMDE's end [15:55:45] !log jiji@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:57:25] !log jiji@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:57:48] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:57:51] (03CR) 10BCornwall: sre.cdn.roll-restart-ats: add cookbook for restarting ATS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [15:57:58] 10SRE-SLO, 10observability: Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10880466 (10herron) For a quick simulation of viewing 4w over a longer window like 8w or 12w, we could view our current 12w window over a longer period. For instance here is a 12w (84d) pyrra w... [15:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T395518#10880492 (10phaultfinder) [16:00:06] jhathaway and moritzm: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1600). [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:04:00] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:04:29] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:04:41] (03CR) 10BCornwall: [C:03+1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [16:06:29] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:07:06] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: mount mediawiki via hostPath feature #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [16:08:56] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes:mediawiki_runner: introduce mw-experimental #5 [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:09:14] (03Merged) 10jenkins-bot: mediawiki: mount mediawiki via hostPath feature #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [16:09:58] (03Abandoned) 10BCornwall: lvs: Switch lvs1017/lvs1020 primary [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [16:11:35] (03PS1) 10Cwhite: logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215) [16:12:00] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:12:34] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:12:59] (03PS1) 10Bvibber: Fixes: Charts embedded in template rendering in Parsoid [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153281 (https://phabricator.wikimedia.org/T395462) [16:13:22] (03PS1) 10Bvibber: Fixes: Charts embedded in template rendering in Parsoid [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153282 (https://phabricator.wikimedia.org/T395462) [16:17:15] Any objection to me running a quicky Chart patch backport & service deploy? [16:17:35] 06SRE, 10SRE-SLO, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10880585 (10herron) I'm experimenting with self referencing links in the grafana slo review/list dashboard (https://grafana-rw.wikimedia.org/d/YuUM... [16:18:04] jouncebot: next [16:18:04] In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1700) [16:18:30] I will run scap, but it should be essentially a noop, we are doing it just to clear the diff a wee bit [16:18:46] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10880601 (10Jhancock.wm) @Andrew hey I'm running into an issue in this rack with port availability. because of the servers using 2x1G ports. how much a pa... [16:20:31] :) [16:20:33] !log jiji@deploy1003 Started scap sync-world: T276994: We merged a number of noop patches, sparing deployers the scary diffs [16:20:37] T276994: Provide an mwdebug functionality on kubernetes (mw-experimental) - https://phabricator.wikimedia.org/T276994 [16:21:45] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:22:02] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:22:11] sorry didn't see the deploy, aborting [16:22:27] (03CR) 10Btullis: [C:03+1] "Super, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [16:23:03] (03PS1) 10Clément Goubert: mw-cron: Remove limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153286 [16:23:32] !log jiji@deploy1003 Finished scap sync-world: T276994: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) [16:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:23:39] 10SRE-SLO, 10observability: Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10880629 (10elukey) +1 makes sense! [16:25:13] (03CR) 10Ssingh: [V:03+1] "Thanks Brett for the review. I am waiting for Valentin to check as well since he had some concerns about $::site so will wait for that bef" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [16:25:56] (03CR) 10Vgutierrez: [C:03+1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [16:27:15] (03PS1) 10Cwhite: beta-logs: sync curator_jobs definition [puppet] - 10https://gerrit.wikimedia.org/r/1153289 [16:27:47] ok, checking exponential backoff :D [16:28:14] if no objection i'm backporting a small Charts fix to support a service update, then deploying the service update :D [16:29:01] starting on spiderpig... [16:29:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153282 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [16:29:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153281 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [16:32:28] (03Merged) 10jenkins-bot: Fixes: Charts embedded in template rendering in Parsoid [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153282 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [16:32:30] (03Merged) 10jenkins-bot: Fixes: Charts embedded in template rendering in Parsoid [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153281 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [16:32:56] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1153282|Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281|Fixes: Charts embedded in template rendering in Parsoid (T395462)]] [16:33:00] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [16:33:38] !log sukhe@dns1004 START - running authdns-update [16:33:55] !log testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu [16:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:16] !log sukhe@dns1004 END - running authdns-update [16:35:00] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1153282|Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281|Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:35:53] !log bvibber@deploy1003 bvibber: Continuing with sync [16:35:59] didn't explode! [16:36:14] (03CR) 10Cwhite: [C:03+2] beta-logs: sync curator_jobs definition [puppet] - 10https://gerrit.wikimedia.org/r/1153289 (owner: 10Cwhite) [16:36:23] (03PS2) 10Cwhite: beta-logs: sync curator_jobs definition [puppet] - 10https://gerrit.wikimedia.org/r/1153289 [16:38:24] (03CR) 10Majavah: [C:03+2] Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153146 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [16:38:54] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [16:39:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153292 [16:39:59] (03PS5) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [16:40:02] (03CR) 10Effie Mouzeli: "I would like to start with the bare minimum changes here, compared to prod, and adjust accordingly based on what users/we need" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:40:04] (03CR) 10Cwhite: [C:03+2] beta-logs: sync curator_jobs definition [puppet] - 10https://gerrit.wikimedia.org/r/1153289 (owner: 10Cwhite) [16:40:56] (03CR) 10Effie Mouzeli: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:41:24] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:42:50] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153282|Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281|Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) [16:42:54] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [16:43:12] (03PS2) 10CDanis: otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750) [16:43:12] (03PS2) 10CDanis: otelcol: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) [16:43:14] ok mw side is complete now a service update [16:44:16] (03CR) 10Xcollazo: [C:03+1] dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153273 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [16:45:50] (03PS2) 10Cwhite: logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215) [16:46:49] (03PS1) 10Hnowlan: admin_ng: bump eventrouter memory limit significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153300 [16:49:38] (03CR) 10CDanis: [C:03+2] otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [16:49:57] (03CR) 10Clément Goubert: [C:03+1] admin_ng: bump eventrouter memory limit significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153300 (owner: 10Hnowlan) [16:50:38] agh have to merge that bit on service, that's why it wasn't showing up in pipeline build lol [16:51:36] (03PS1) 10Clément Goubert: k8s-controller-sidecars: Version bump [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1153302 [16:52:51] (03CR) 10Scott French: [C:03+1] mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:53:14] (03PS1) 10Majavah: Revert "Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend"" [puppet] - 10https://gerrit.wikimedia.org/r/1153304 [16:53:19] (03Abandoned) 10Cyndywikime: Add alerting for ALIS availability [alerts] - 10https://gerrit.wikimedia.org/r/1153116 (https://phabricator.wikimedia.org/T386116) (owner: 10Cyndywikime) [16:53:47] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/1153294 (owner: 10L10n-bot) [16:54:49] (03CR) 10Majavah: [C:03+2] Revert "Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend"" [puppet] - 10https://gerrit.wikimedia.org/r/1153304 (owner: 10Majavah) [16:55:36] (03CR) 10Hnowlan: [C:03+1] k8s-controller-sidecars: Version bump [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1153302 (owner: 10Clément Goubert) [16:55:40] (03CR) 10Scott French: mw-experimental: create new service #6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:56:30] (03PS1) 10Clément Goubert: k8s-controller-sidecar: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153305 [16:56:42] (03PS6) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [16:56:46] (03PS2) 10Clément Goubert: k8s-controller-sidecar: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153305 [16:57:02] (03CR) 10Clément Goubert: [V:03+2 C:03+2] k8s-controller-sidecars: Version bump [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1153302 (owner: 10Clément Goubert) [16:57:37] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies T390767 [16:57:43] T390767: Remove the compatibility layer of block schema in wikireplicas - https://phabricator.wikimedia.org/T390767 [16:58:47] (03PS11) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [16:59:55] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:00:04] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1700). [17:00:14] o/ [17:00:40] (03CR) 10Eevans: "This is actually tested (using @fgiunchedi@wikimedia.org 's kvm test harness)! And, I will test it further (in cassandra-dev) before it's" [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [17:00:49] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:01:02] (03CR) 10Effie Mouzeli: mw-experimental: create new service #6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:01:07] (03PS1) 10Bvibber: Update chart-renderer to 2025-06-03-165036-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153306 (https://phabricator.wikimedia.org/T395462) [17:01:26] ok preparing to deploy chart-renderer service update [17:01:28] (03PS1) 10Dreamy Jazz: Enable temporary accounts onboarding dialog on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) [17:01:53] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:02:12] (03PS2) 10Dreamy Jazz: Enable temporary accounts onboarding dialog on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) [17:02:34] (03PS3) 10Dreamy Jazz: Enable temporary accounts onboarding dialog on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) [17:02:58] (03CR) 10Scott French: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:03:52] any objections to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1153306 ? [17:04:18] (03CR) 10Dzahn: [C:03+2] zuul: create profile to setup system user and group [puppet] - 10https://gerrit.wikimedia.org/r/1152145 (owner: 10Dzahn) [17:04:31] 10SRE-SLO, 10observability: Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10880789 (10herron) I went ahead and made adjustments that I think simplify the fixed window view https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarterly-drilldown {F61386259} The changes w... [17:05:18] getting started on infra window work shortly [17:05:58] bvibber: feel free to make changes to chart-renderer :) just avoid deploying mediawiki [17:06:05] ok :D [17:06:11] i'm done with my mw-side changes so feel free [17:06:26] swfrench-wmf: go forth and tweak mw land as needed :D [17:06:41] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:06:57] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:07:17] (03CR) 10Scott French: [C:03+2] Revert^2 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:08:03] (03CR) 10Bvibber: [C:03+2] "let's hope i didn't explode it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153306 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:08:39] (03CR) 10Hnowlan: [C:03+1] k8s-controller-sidecar: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153305 (owner: 10Clément Goubert) [17:10:02] (03Merged) 10jenkins-bot: Update chart-renderer to 2025-06-03-165036-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153306 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:10:03] (03CR) 10CDanis: [C:03+2] otelcol: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [17:11:07] looks like we've got a couple service-related updates in there ;) [17:11:24] (03CR) 10CDanis: [V:03+2 C:03+2] otelcol: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [17:11:29] (03CR) 10Hnowlan: [C:03+2] admin_ng: bump eventrouter memory limit significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153300 (owner: 10Hnowlan) [17:11:47] cdanis: want me to leave helm deploy to you since you're actively working in there? [17:12:08] bvibber: oh, it's a different namespace entirely 😅 you can be deploying at the same time [17:12:13] hah ok [17:12:16] i'll go ahead then [17:12:25] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:13:06] cdanis: apologies but I just +2ed an admin_ng change that might show up [17:13:14] it's safe to apply but it hasn't merged yet [17:13:17] !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:13:31] hnowlan: no worries, I'm happy to roll it into mine in eqiad [17:13:41] thanks! [17:13:54] !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:14:11] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:14:43] hnowlan: if you're waiting for gate-and-submit btw i'd just V+2 and submit manually, the queue got DoS'd by the localization updates bot [17:14:54] !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [17:15:26] cdanis: ahh I was wondering why CI was taking so long earlier. There's no rush [17:15:37] ah okay, I'll just finish up then [17:15:43] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:15:52] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:16:04] !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [17:16:38] !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [17:17:09] !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [17:17:51] (03PS1) 10Dzahn: zuul (new): remove dependency on docker class [puppet] - 10https://gerrit.wikimedia.org/r/1153310 [17:19:01] alright, y'all are going to see some scap logs in a minute - this should be a noop w.r.t. any of the outstanding changes [17:19:46] !log swfrench@deploy1003 Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T388761 T389786 [17:19:49] T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761 [17:19:50] T389786: Integrate mediawiki-dumps-legacy with the regular MW scap deployments - https://phabricator.wikimedia.org/T389786 [17:19:52] (03Merged) 10jenkins-bot: admin_ng: bump eventrouter memory limit significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153300 (owner: 10Hnowlan) [17:21:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10880896 (10Dzahn) Further setup from here on will be part of T395938 to avoid amending to the VM request again and... [17:23:36] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939 (10Jclark-ctr) 03NEW [17:24:05] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#10880927 (10Jclark-ctr) p:05Triage→03Medium [17:24:38] * swfrench-wmf shakes fist at mediawiki-dumps-legacy [17:25:42] (03PS1) 10Scott French: Revert^3 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1153316 (https://phabricator.wikimedia.org/T389786) [17:25:53] (03PS2) 10Dzahn: zuul (new): remove dependency on docker class [puppet] - 10https://gerrit.wikimedia.org/r/1153310 (https://phabricator.wikimedia.org/T395938) [17:26:38] (03CR) 10Scott French: [C:03+2] Revert^3 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1153316 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:27:29] (03CR) 10Dzahn: [C:03+2] zuul (new): remove dependency on docker class [puppet] - 10https://gerrit.wikimedia.org/r/1153310 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:28:49] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1152423 (owner: 10Ncmonitor) [17:30:32] (03PS1) 10Bvibber: Revert "Update chart-renderer to 2025-06-03-165036-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153319 [17:32:44] (03CR) 10Bvibber: [C:03+2] "unbreak" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153319 (owner: 10Bvibber) [17:34:07] !log swfrench@deploy1003 Started scap sync-world: Scap test run after revert - T389786 [17:34:12] (03Merged) 10jenkins-bot: Revert "Update chart-renderer to 2025-06-03-165036-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153319 (owner: 10Bvibber) [17:34:12] T389786: Integrate mediawiki-dumps-legacy with the regular MW scap deployments - https://phabricator.wikimedia.org/T389786 [17:35:37] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [17:35:57] (03CR) 10BCornwall: [V:03+2 C:03+2] "https://phabricator.wikimedia.org/T388809#10805702" [puppet] - 10https://gerrit.wikimedia.org/r/1152422 (owner: 10Ncmonitor) [17:36:01] !log swfrench@deploy1003 Finished scap sync-world: Scap test run after revert - T389786 (duration: 02m 10s) [17:36:28] (03PS1) 10Hnowlan: admin_ng: bump limits for eventrouter in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153320 [17:36:34] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153266 (owner: 10Kevin Bazira) [17:36:43] (03CR) 10Hnowlan: [C:03+1] mw-cron: Remove limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153286 (owner: 10Clément Goubert) [17:37:05] 10ops-eqiad, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941 (10Jclark-ctr) 03NEW [17:37:18] 10ops-eqiad, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10880985 (10Jclark-ctr) p:05Triage→03Medium [17:38:45] !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:38:57] !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:39:05] !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [17:39:15] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:39:25] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10881004 (10Dzahn) a:05Corvus→03KFrancis [17:39:38] !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [17:39:48] !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [17:39:52] alright, now that the mediawiki-dumps-legacy mess is dealt with, I'll deploy 1153286 and be done with the infra window for today [17:40:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10881015 (10Dzahn) [17:41:08] (03PS1) 10Cwhite: logstash: drop thumbor unstructured logs [puppet] - 10https://gerrit.wikimedia.org/r/1153322 (https://phabricator.wikimedia.org/T368180) [17:41:10] (03CR) 10Scott French: [C:03+1] mw-cron: Remove limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153286 (owner: 10Clément Goubert) [17:41:39] (03CR) 10Scott French: [C:03+2] "Merging as discussed with @cgoubert@wikimedia.org out of band." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153286 (owner: 10Clément Goubert) [17:41:42] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:42:00] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:43:07] (03Merged) 10jenkins-bot: mw-cron: Remove limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153286 (owner: 10Clément Goubert) [17:43:28] (03CR) 10CI reject: [V:04-1] logstash: drop thumbor unstructured logs [puppet] - 10https://gerrit.wikimedia.org/r/1153322 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [17:44:59] 10ops-eqiad, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10881034 (10RobH) This came up as a discussion in our project sync meeting. Each rack will have a PDU and 1 server for mgmt network connections to start, plus an e... [17:45:37] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:45:58] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:46:27] (03CR) 10Gmodena: [C:03+2] dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153273 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [17:46:31] alright, I'm done touching things :) [17:46:41] 10ops-eqiad, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10881042 (10RobH) [17:48:08] (03Merged) 10jenkins-bot: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153273 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [17:50:46] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [17:50:49] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [17:52:27] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10881072 (10VPuffetMichel) [17:53:26] (03PS2) 10AOkoth: trafficserver: point os-reports to k8s record [puppet] - 10https://gerrit.wikimedia.org/r/1152305 (https://phabricator.wikimedia.org/T350794) [17:54:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.217s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:55:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.769s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:55:25] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10881099 (10VPuffetMichel) @elukey I moved this to our kanban board. David is out on vacation this week. He'll... [17:57:13] (03PS1) 10BCornwall: ncredir: Increase hash sizes [puppet] - 10https://gerrit.wikimedia.org/r/1153327 [17:59:45] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5758/co" [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [18:00:05] dduvall and dancy: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T1800). [18:00:55] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [18:02:47] (03CR) 10Ssingh: ncredir: Increase hash sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [18:03:29] (03CR) 10BCornwall: [V:03+1] ncredir: Increase hash sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [18:05:04] (03CR) 10Ssingh: [C:03+1] "Looks good to me with the due diligence done but I will defer to Valentin's approval as he owns ncredir." [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [18:08:11] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [18:10:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.764s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.344s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:12:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:13:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [18:13:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:16:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:18:04] (03PS1) 10Cwhite: logstash: drop high volume of tegola-vector-tiles logs [puppet] - 10https://gerrit.wikimedia.org/r/1153329 (https://phabricator.wikimedia.org/T387261) [18:19:12] (03CR) 10Dzahn: [C:03+2] gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:19:18] (03PS3) 10Dzahn: gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) [18:21:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10881244 (10BCornwall) [18:21:29] (03CR) 10Dzahn: [C:03+2] gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:21:52] (03CR) 10Dzahn: [C:03+2] "this is not touching anything regarding existing replication" [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:22:43] brett: i just did both [18:24:04] mutante: Er, both of what? [18:24:22] puppet changes that were merged but not puppet-merged [18:24:30] pywikipedia in ncredir [18:24:42] oh, sonofabitch, sorry I forgot [18:24:53] Thanks :) [18:25:06] no worries, it looked harmless enough to me and I had seen that patch before [18:25:20] ha, yeah, that saga is hopefully done with now [18:25:21] like either we have it in DNS now or we dont :) [18:25:31] alright, sounds good [18:27:39] (03CR) 10Dzahn: [C:03+2] aptrepo: add thirdparty/ci component to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:28:19] (03CR) 10Dzahn: [C:03+2] "I should have changed the commit message title.. noticed 10 seconds too late." [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:39:07] (03PS1) 10BCornwall: Rotate SSH key for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1153331 (https://phabricator.wikimedia.org/T393140) [18:40:35] (03CR) 10Scott French: hieradata: Make wikikube-worker2100 a mw-experimental worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [18:44:37] (03CR) 10Scott French: [C:03+1] mw-experimental: create new service #6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [18:47:21] !log ebernhardson@deploy1003 Started deploy [wdqs/wdqs@fea7794]: 0.3.157 [18:49:09] (03PS2) 10Cwhite: logstash: drop thumbor unstructured logs [puppet] - 10https://gerrit.wikimedia.org/r/1153322 (https://phabricator.wikimedia.org/T368180) [18:53:59] (03PS1) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [19:00:06] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [19:01:02] (03CR) 10Dzahn: "due to the revert this is not amendable before a re-revert. But what I wanted to suggest as a middle ground is.. let's forget about introd" [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:18] !log ebernhardson@deploy1003 Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) [19:05:19] (03Abandoned) 10Dzahn: gerrit: introduce second daemon_user name [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:07:44] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) [19:08:39] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) [19:09:04] (manually cancelled, decided i should use a different stat host from the parallel reload) [19:10:00] ryankemper interesting, I didn't realize we could do data reloads from HDFS [19:11:22] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) [19:12:21] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153336 (https://phabricator.wikimedia.org/T392174) [19:12:22] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153336 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [19:13:09] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153336 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [19:13:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:13:37] (03PS1) 10Ryan Kemper: sre.wdqs.data-reload: flesh out commands to check stat host space [cookbooks] - 10https://gerrit.wikimedia.org/r/1153337 [19:15:05] (03CR) 10Bking: [C:03+1] sre.wdqs.data-reload: flesh out commands to check stat host space [cookbooks] - 10https://gerrit.wikimedia.org/r/1153337 (owner: 10Ryan Kemper) [19:16:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:22:29] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10881391 (10Jgreen) >>! In T394788#10876055, @Dzahn wrote: > This seems like a continuation of T330944 from 2023. Yeah, pretty much. [19:22:39] 10ops-codfw, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 2 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954 (10Eevans) 03NEW [19:22:55] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs T392174 [19:22:58] T392174: 1.45.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T392174 [19:23:39] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955 (10Eevans) 03NEW [19:23:59] 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10881426 (10Eevans) [19:24:50] 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10881442 (10Eevans) [19:24:58] 10ops-codfw, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 2 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10881443 (10Eevans) [19:35:22] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) [19:35:23] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) [19:36:05] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) [19:36:10] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) [19:36:32] (03CR) 10Ssingh: "Which cluster would these be a part of? Because an equivalent entry needs to be created in conftool-data/node/$site.yaml, unless I am mist" [puppet] - 10https://gerrit.wikimedia.org/r/1151308 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:37:44] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) [19:37:49] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) [19:40:17] (03CR) 10Ssingh: [C:03+1] "(Let's pug bug # number.)" [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [19:40:49] (03CR) 10Ssingh: "I am not very sure about this. My recommendation would be to simply create per-service DNS discovery records instead of this CNAME. Though" [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:46:50] (03PS1) 10C. Scott Ananian: Use ::getContentId() and ::clearContentId() from the Parsoid extension API [extensions/Cite] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153341 [19:47:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Cite] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153341 (owner: 10C. Scott Ananian) [19:48:08] (03PS5) 10Ryan Kemper: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:48:32] (03CR) 10Ryan Kemper: "Bug added. patch still needs a rebase onto production branch but will hold off on that for the timebeing" [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:51:37] (03CR) 10Ryan Kemper: "There's not an entry for `search` in conftool-data/node/$site.yaml, so why would we need one for these 2 new entries?" [puppet] - 10https://gerrit.wikimedia.org/r/1151308 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:51:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10881532 (10Andrew) 05Resolved→03Open a:05Jclark-ctr→03cmooney The cookbook is rejecting cloudcephosd1048 for failing network tests, so I assume th... [19:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:58:30] (03CR) 10Ryan Kemper: [C:03+2] sre.wdqs.data-reload: flesh out commands to check stat host space [cookbooks] - 10https://gerrit.wikimedia.org/r/1153337 (owner: 10Ryan Kemper) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T2000). [20:00:04] kimberly_sarabia and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hello [20:00:48] Subbu is here for me for the next ~15min but I should be fully online by the time the first patch in the window is deployed [20:02:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10881582 (10Jgreen) [20:02:58] (03CR) 10Dzahn: [C:03+2] "[apt1002:~] $ sudo -i reprepro -C thirdparty/jenkins list bookworm-wikimedia jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [20:04:33] hi - i can deploy [20:04:46] cjming: ty [20:04:56] (03PS3) 10Kimberly Sarabia: Deploy survey to en at twenty percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) [20:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:06:05] cscott: ack [20:06:07] (03Merged) 10jenkins-bot: sre.wdqs.data-reload: flesh out commands to check stat host space [cookbooks] - 10https://gerrit.wikimedia.org/r/1153337 (owner: 10Ryan Kemper) [20:06:24] (03Merged) 10jenkins-bot: Deploy survey to en at twenty percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:06:49] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152860|Deploy survey to en at twenty percent (T389393)]] [20:06:53] T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393 [20:08:14] o/ [20:08:57] !log cjming@deploy1003 ksarabia, cjming: Backport for [[gerrit:1152860|Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:13] sarabia: ^^ if you'd like to test [20:09:31] cjming: ok one moment [20:09:34] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) [20:10:30] (03CR) 10Bking: "That seems reasonable to me. Let me check with my teammates and I'll update the patch if everyone is OK w/it." [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:10:57] cjming: LGTM [20:11:02] cool [20:11:08] !log cjming@deploy1003 ksarabia, cjming: Continuing with sync [20:12:35] ok, i'm back. and i can spiderpig deploy my patch when it comes time [20:12:55] great! 1st patch is syncing so any minute now [20:14:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10881598 (10Jgreen) [20:16:53] 10ops-eqiad, 06SRE, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10881600 (10Jclark-ctr) [20:18:07] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152860|Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) [20:18:13] T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393 [20:18:15] cscott: all yours [20:18:26] sarabia: should be live :) [20:19:33] ok! [20:19:50] cjming: thanks! [20:19:57] yw! [20:20:57] (03CR) 10Bking: "Upon further review, the DNS discovery records are set in a subsequent patch ( https://gerrit.wikimedia.org/r/c/operations/dns/+/1151304/5" [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:22:07] and for the record, this backport has dependencies, but both of the dependencies are already in wmf.4 already. it's a good warning from spiderpig, although "found dependencies are neither configuration [20:22:07] changes nor do they belong to the same branch. " isn't exactly accurate -- they belong to the branch, they are just before the branch point. [20:22:26] i'll give spiderpig a pass here though because parsoid dependencies go through mediawiki-vendor in what can be a confusing way [20:22:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10881658 (10wiki_willy) Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is completed (likely in Q1). Can these go on 1g switc... [20:23:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Cite] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153341 (owner: 10C. Scott Ananian) [20:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:24:43] (03Merged) 10jenkins-bot: Use ::getContentId() and ::clearContentId() from the Parsoid extension API [extensions/Cite] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153341 (owner: 10C. Scott Ananian) [20:25:07] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1153341|Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] [20:25:55] (03CR) 10Scott French: "Thanks, Amir! This seems reasonable to me. As you note, this isn't the cleanest solution, but it certainly seems like the simplest." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [20:27:10] !log cscott@deploy1003 cscott: Backport for [[gerrit:1153341|Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:30] ok, testing now on group0 [20:30:48] ok, it looks good. I'm going to proceed. [20:30:55] !log cscott@deploy1003 cscott: Continuing with sync [20:31:09] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [20:33:11] hi folks, can i add a late addition to the backport window? i want to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1153339 [20:33:32] (03PS1) 10Muehlenhoff: Also configure new update config for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1153349 (https://phabricator.wikimedia.org/T392127) [20:34:06] (03CR) 10Dzahn: "hah, thanks! I was in the middle of trying to run the initial update for the new component" [puppet] - 10https://gerrit.wikimedia.org/r/1153349 (https://phabricator.wikimedia.org/T392127) (owner: 10Muehlenhoff) [20:34:27] (03PS1) 10Bartosz Dziewoński: Use default preference if no client preference in auth request [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153350 (https://phabricator.wikimedia.org/T395957) [20:34:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153350 (https://phabricator.wikimedia.org/T395957) (owner: 10Bartosz Dziewoński) [20:34:46] (03PS1) 10SBassett: Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153351 [20:35:51] (03CR) 10Dzahn: [C:03+2] Also configure new update config for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1153349 (https://phabricator.wikimedia.org/T392127) (owner: 10Muehlenhoff) [20:36:16] MatmaRex: fine by me -- do you want to self-deploy or do you need a deployer? [20:37:02] i can't, i'd appreciate if someone could deploy [20:37:30] (03PS2) 10SBassett: Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153351 (https://phabricator.wikimedia.org/T150898) [20:37:45] np - i'll do it after cscott is done [20:37:48] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153341|Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) [20:38:12] i'm done [20:38:18] perfect timing [20:38:19] looks good to me [20:38:32] yay! [20:40:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153350 (https://phabricator.wikimedia.org/T395957) (owner: 10Bartosz Dziewoński) [20:40:40] (03PS2) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [20:42:01] MatmaRex: if you want to though you're a permission request away: https://idm.wikimedia.org/permissions/ :) [20:42:32] c.f., https://wikitech.wikimedia.org/wiki/Scap/SpiderPig [20:42:51] does it no longer require me to ssh into production for a one-time token thing? [20:43:16] it does, but we can get you what you need for it [20:44:13] well. i'm just a guinea pig making sure that people without ssh access can still get deployments done :) [20:46:05] heh, well. played. It's not too bad of a process if you change your mind. I've been filing access requests for folks who request spiderpig access. We'd like to remove it at some point, hopefully to be replaced with future idm 2fa. [20:46:56] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning [20:46:56] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet [20:47:22] the "it" that we'd like to remove in my confusing sentence above is the ssh 2fa [20:51:58] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966 (10thcipriani) 03NEW [20:52:49] (03Merged) 10jenkins-bot: Use default preference if no client preference in auth request [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153350 (https://phabricator.wikimedia.org/T395957) (owner: 10Bartosz Dziewoński) [20:53:13] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1153350|Use default preference if no client preference in auth request (T395957)]] [20:53:16] T395957: PHP Warning: Undefined array key "clientPref" - https://phabricator.wikimedia.org/T395957 [20:53:56] (03PS1) 10Cwhite: logstash: rename cp0000 hosts to cp1000 [puppet] - 10https://gerrit.wikimedia.org/r/1153359 [20:54:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:55:19] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1153350|Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:32] MatmaRex: ^^ if testable [20:55:40] cjming: not really [20:55:48] i will sync then [20:55:55] thanks. i plan to look at the logs to verify later [20:56:05] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:59:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:59:27] (03PS2) 10Cwhite: logstash: rename cp0000.*.wmnet hosts to cp1000 [puppet] - 10https://gerrit.wikimedia.org/r/1153359 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250603T2100) [21:00:05] (03PS1) 10SBassett: Add Speedios999 to security.wikimedia.org hall of fame [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153360 (https://phabricator.wikimedia.org/T395195) [21:01:13] (03PS3) 10Cwhite: logstash: rename cp0000.*.wmnet hosts to cp1000 in tests [puppet] - 10https://gerrit.wikimedia.org/r/1153359 [21:02:56] (03CR) 10SBassett: [C:04-1] "Waiting on published image to show up at https://docker-registry.wikimedia.org/repos/sre/miscweb/security-landing-page/tags/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153360 (https://phabricator.wikimedia.org/T395195) (owner: 10SBassett) [21:03:02] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153350|Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) [21:03:04] T395957: PHP Warning: Undefined array key "clientPref" - https://phabricator.wikimedia.org/T395957 [21:03:12] thanks cjming [21:03:23] yw! [21:04:55] I'm about to do a security related config deploy. are the current deploys finished? [21:05:08] we're done! all yours maryum [21:05:16] awesome, thanks! [21:05:54] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10881855 (10Andrew) I think I don't care what port it's in -- that's the same rack, right? [21:07:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153351 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [21:09:16] (03Merged) 10jenkins-bot: Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153351 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [21:09:27] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10881864 (10Jhancock.wm) yeah same rack. I just need someone to migrate it for me since cloud is more complex than the netbox script can handle. and possibl... [21:09:39] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1153351|Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] [21:09:42] T150898: Force OATHAuth (2FA) for certain user groups in Wikimedia production - https://phabricator.wikimedia.org/T150898 [21:11:43] !log mstyles@deploy1003 mstyles, sbassett: Backport for [[gerrit:1153351|Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:11] (03CR) 10Cwhite: [C:03+2] logstash: rename cp0000.*.wmnet hosts to cp1000 in tests [puppet] - 10https://gerrit.wikimedia.org/r/1153359 (owner: 10Cwhite) [21:12:27] (03PS13) 10Dreamy Jazz: Enable electionclerk user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [21:12:48] (03PS3) 10Cwhite: logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) [21:13:50] (03PS1) 10Gergő Tisza: logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) [21:13:51] (03PS1) 10Gergő Tisza: logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) [21:14:20] !log mstyles@deploy1003 mstyles, sbassett: Continuing with sync [21:14:44] (03CR) 10CI reject: [V:04-1] logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [21:14:46] (03CR) 10CI reject: [V:04-1] logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [21:17:09] (03CR) 10SBassett: Add Speedios999 to security.wikimedia.org hall of fame [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153360 (https://phabricator.wikimedia.org/T395195) (owner: 10SBassett) [21:18:09] !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [21:18:29] that should've finished a while ago, sorry :D [21:21:11] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153351|Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) [21:21:14] T150898: Force OATHAuth (2FA) for certain user groups in Wikimedia production - https://phabricator.wikimedia.org/T150898 [21:24:12] (03CR) 10Mstyles: [C:03+2] "verified that the tag is now present in the docker registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153360 (https://phabricator.wikimedia.org/T395195) (owner: 10SBassett) [21:24:51] (03CR) 10Dreamy Jazz: [C:03+1] "Looks good to merge once I24ebb8e8435bc30ac528a107b0be5d430c01e3a6 is deployed to all wikis next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [21:26:02] (03Merged) 10jenkins-bot: Add Speedios999 to security.wikimedia.org hall of fame [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153360 (https://phabricator.wikimedia.org/T395195) (owner: 10SBassett) [21:28:00] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [21:38:06] !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [21:38:27] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [21:41:10] !log removing 2 files for legal compliance [21:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:21] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971 (10Jclark-ctr) 03NEW [21:48:39] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [21:53:41] !log removing 4 files for legal compliance [21:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:57] (03CR) 10Jforrester: Deploy survey to en at twenty percent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152860 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [21:59:00] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [21:59:53] (03CR) 10Jforrester: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1153331 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall) [22:02:23] (03CR) 10Ladsgroup: etcd: Remove ES clusters from "write clusters" if section is RO (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [22:05:49] (03CR) 10Ladsgroup: "That would work for me" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [22:09:22] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:14:03] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10882050 (10RobH) Ok, I had to chat with John a bit in IRC to have a full understanding of the scope of this, and how it could have occurred. Short Answer: https://www.servertech.com/support someone in eqiad shoul... [22:23:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10882069 (10VRiley-WMF) Hey @Dwisehaupt thanks for letting us know. I have reseated the cables and confirmed they have a firm connection on all the 10... [22:39:35] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [22:42:10] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:52:54] (03PS1) 10Jforrester: Bump portals to the 2025-06-02 09:23:11+00:00 build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153385 (https://phabricator.wikimedia.org/T128546) [22:54:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153385 (https://phabricator.wikimedia.org/T128546) (owner: 10Jforrester) [22:54:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 (owner: 10Jforrester) [22:55:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) (owner: 10Jforrester) [23:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:50] (03CR) 10Scott French: [C:03+1] "Thanks, Amir!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [23:34:02] (03CR) 10Scott French: [C:03+1] "And immediately after pressing send I recalled that conftool will require a change as well. I'll follow up on T395696." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153389 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153389 (owner: 10TrainBranchBot) [23:41:14] (03CR) 10Jforrester: "check experimental" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148868 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [23:49:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153389 (owner: 10TrainBranchBot) [23:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity