[00:08:11] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2071 MB (3% inode=95%): /tmp 2071 MB (3% inode=95%): /var/tmp 2071 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [00:08:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153404 [00:08:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153404 (owner: 10TrainBranchBot) [00:10:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:19:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:24:33] PROBLEM - Check unit status of sync-puppet-volatile on puppetmaster2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:29:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153404 (owner: 10TrainBranchBot) [00:34:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:33] RECOVERY - Check unit status of sync-puppet-volatile on puppetmaster2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:54:15] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/850ae973137843de745a742424486f7b19af2b5aac7f35bcb81766e667b2dfb7/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:04:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10882367 (10Dwisehaupt) @VRiley-WMF Thanks. I have rebuilt both of the boxes and things are looking better. pay-lb1001 shows both interfaces up and bo... [01:14:15] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:43:33] (03CR) 10BCornwall: [C:03+2] Rotate SSH key for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1153331 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall) [01:44:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10882384 (10BCornwall) 05Open→03Resolved a:03BCornwall This has been merged and will be in effect shortly. Thanks! [01:48:11] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2041 MB (3% inode=95%): /tmp 2041 MB (3% inode=95%): /var/tmp 2041 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [02:01:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882396 (10BCornwall) [02:02:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882398 (10BCornwall) a:05VRiley-WMF→03BCornwall [02:03:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882400 (10BCornwall) [02:12:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882406 (10BCornwall) [02:13:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882408 (10BCornwall) [02:14:40] (03PS1) 10BCornwall: hiera: Replace lvs1017 with lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [02:15:33] (03PS2) 10BCornwall: hiera: Replace lvs1017 with lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [02:16:18] (03CR) 10BCornwall: [C:04-2] "Once lvs1016 is reimaged then we can visit this." [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [02:17:57] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:35:31] PROBLEM - snapshot of s8 in codfw on backupmon1001 is CRITICAL: Last snapshot for s8 at codfw (db2198) taken on 2025-06-04 01:38:26 is 1088 GiB, but the previous one was 1351 GiB, a change of -19.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:45] PROBLEM - snapshot of s8 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s8 at eqiad (db1171) taken on 2025-06-04 01:52:23 is 1216 GiB, but the previous one was 1587 GiB, a change of -23.3 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:35:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:33] PROBLEM - Check unit status of sync-puppet-volatile on puppetmaster2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:45:33] RECOVERY - Check unit status of sync-puppet-volatile on puppetmaster2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:49:36] (03CR) 10Bartosz Dziewoński: [C:04-1] "+1, except there's a test case for the current behavior." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [03:50:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:36] (03CR) 10Bartosz Dziewoński: [C:04-1] "We definitely should do this, I was about to suggest it too. Thanks for tracking down that the throttling exists." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [03:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:28:11] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2123 MB (3% inode=95%): /tmp 2123 MB (3% inode=95%): /var/tmp 2123 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [04:53:51] PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - free space: / 1931 MB (3% inode=94%): /tmp 1931 MB (3% inode=94%): /var/tmp 1931 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10882518 (10Marostegui) >>! In T393107#10881658, @wiki_willy wrote: > Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is compl... [05:31:59] (03CR) 10Giuseppe Lavagetto: "My take on this is that we should still keep downloading metadata on creation instead of sorting, see my suggestion inline which amounts t" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [05:33:51] PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - free space: / 2125 MB (3% inode=94%): /tmp 2125 MB (3% inode=94%): /var/tmp 2125 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [05:40:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1153429 (https://phabricator.wikimedia.org/T395982) [05:40:26] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1153430 (https://phabricator.wikimedia.org/T395982) [05:41:01] (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153431 (https://phabricator.wikimedia.org/T395982) [05:42:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153431 (https://phabricator.wikimedia.org/T395982) (owner: 10Marostegui) [05:42:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T395982 [05:42:49] T395982: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T395982 [05:42:54] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153431 (https://phabricator.wikimedia.org/T395982) (owner: 10Marostegui) [05:43:41] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1153431|db-production.php: Disable writes on es7 (T395982)]] [05:45:52] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1153431|db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:49:44] !log marostegui@deploy1003 marostegui: Continuing with sync [05:53:49] (03CR) 10Giuseppe Lavagetto: cache::haproxy: fully set x-provenance (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [05:55:28] (03PS3) 10Giuseppe Lavagetto: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) [05:55:28] (03PS3) 10Giuseppe Lavagetto: haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 [05:55:36] (03CR) 10Giuseppe Lavagetto: haproxy: remove conditionals on wikimedia_trust (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152894 (owner: 10Giuseppe Lavagetto) [05:55:49] (03CR) 10CI reject: [V:04-1] cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [05:56:03] (03CR) 10CI reject: [V:04-1] haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 (owner: 10Giuseppe Lavagetto) [05:56:41] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153431|db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) [05:56:44] T395982: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T395982 [05:57:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1039 with weight 0 T395982', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json [05:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:58:01] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1153429 (https://phabricator.wikimedia.org/T395982) (owner: 10Gerrit maintenance bot) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:21] !log Starting es7 eqiad failover from es1035 to es1039 - T395982 [06:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:24] T395982: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T395982 [06:02:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1039 to es7 primary T395982', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json [06:02:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:03:07] (03CR) 10Marostegui: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1153430 (https://phabricator.wikimedia.org/T395982) (owner: 10Gerrit maintenance bot) [06:03:11] !log marostegui@dns1006 START - running authdns-update [06:03:52] !log marostegui@dns1006 END - running authdns-update [06:04:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1035 T395982', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json [06:04:49] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153432 [06:06:01] (03PS1) 10Marostegui: es1035: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153433 (https://phabricator.wikimedia.org/T395647) [06:06:30] (03CR) 10Marostegui: [C:03+2] es1035: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153433 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui) [06:13:35] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:17:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:20:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json [06:20:29] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:20:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153432 (owner: 10Marostegui) [06:21:32] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153432 (owner: 10Marostegui) [06:21:36] 10ops-eqiad, 06SRE, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10882582 (10ayounsi) That makes sens to me, but I'd prefer we purchase a new (unmanaged) switch and not re-use a decommissioned one. [06:21:57] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1153432|Revert "db-production.php: Disable writes on es7"]] [06:22:51] (03PS1) 10Marostegui: es2048: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1153434 (https://phabricator.wikimedia.org/T395771) [06:23:38] (03CR) 10Marostegui: [C:03+2] es2048: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1153434 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [06:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json [06:24:06] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1153432|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:24:51] !log marostegui@deploy1003 marostegui: Continuing with sync [06:27:18] (03CR) 10Marostegui: [C:04-2] "Actually that's not as easy as parsercache has: datadir => '/srv/sqldata-cache'," [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [06:27:42] (03CR) 10Marostegui: [C:04-2] "We should really fix this snowflake too." [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [06:30:03] (03CR) 10Marostegui: [C:04-2] "https://phabricator.wikimedia.org/T395983" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [06:30:19] (03CR) 10Ayounsi: [C:03+1] "I find `export` less clear than the previous name, but that lgtm. Maybe a comment there too could help in the future ? No strong feeling." [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [06:31:49] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153432|Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) [06:33:42] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [06:33:45] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [06:34:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:35:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json [06:39:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json [06:46:50] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:50] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:48:50] (03CR) 10Slyngshede: [C:03+2] data.yaml: add neslihanturan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1152677 (https://phabricator.wikimedia.org/T394395) (owner: 10Slyngshede) [06:50:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json [06:51:52] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.43 ms [06:51:52] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 74.13 ms [06:54:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json [06:57:22] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:25] jmm@cumin1003 drain-node (PID 133660) is awaiting input [07:01:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10882658 (10SLyngshede-WMF) 05Stalled→03Resolved [07:01:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882659 (10ops-monitoring-bot) Draining ganeti7001.magru.wmnet of running VMs [07:01:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [07:01:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:25] (03CR) 10Nikerabbit: [C:03+1] mw::maintenance: don't run purge-old-cx-drafts against test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1153171 (https://phabricator.wikimedia.org/T395892) (owner: 10Hnowlan) [07:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json [07:08:33] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain [07:08:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882668 (10ops-monitoring-bot) VM install7001.wikimedia.org switching disk type to plain [07:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json [07:09:42] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain [07:13:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:13:36] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain [07:16:29] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987 (10Gehel) 03NEW [07:16:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:16:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882690 (10ops-monitoring-bot) VM ncredir7001.magru.wmnet switching disk type to plain [07:16:46] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10882691 (10Gehel) p:05Triage→03High [07:16:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain [07:18:03] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain [07:19:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882700 (10ops-monitoring-bot) VM doh7001.wikimedia.org switching disk type to plain [07:19:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain [07:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json [07:21:11] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:21:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:53] PROBLEM - Bird Internet Routing Daemon on doh7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:21:53] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [07:22:07] PROBLEM - snapshot of x3 in codfw on backupmon1001 is CRITICAL: Last snapshot for x3 at codfw (db2200) taken on 2025-06-04 07:00:30 is 334 GiB, but the previous one was 1351 GiB, a change of -75.3 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:22:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10882702 (10SLyngshede-WMF) [07:22:53] RECOVERY - Bird Internet Routing Daemon on doh7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:22:53] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7001 is OK: OK: UP (pid=2324) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [07:23:11] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:23:36] !log restart swift-object-replicator ms-be2066 [07:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json [07:27:57] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10882712 (10SLyngshede-WMF) @cmelo can you sign the L3 acknowledgement, see link in the description above and provide an SSH public key (Note that this must be different from any keys you may hav... [07:28:29] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10882713 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF [07:28:35] RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [07:31:44] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain [07:32:31] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2000 MB (3% inode=95%): /tmp 2000 MB (3% inode=95%): /var/tmp 2000 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [07:33:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10882723 (10SLyngshede-WMF) We need an approval from: @Ottomata @Ahoelzl or @Milimetric for access to analytics-privatedata-... [07:34:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882724 (10ops-monitoring-bot) VM durum7001.magru.wmnet switching disk type to plain [07:34:50] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain [07:35:17] (03Abandoned) 10Effie Mouzeli: push-notifications: change version tag to -production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628340 (https://phabricator.wikimedia.org/T256973) (owner: 10MSantos) [07:35:18] (03Abandoned) 10Effie Mouzeli: conftool: Create a shared jobrunner_videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [07:35:18] (03Abandoned) 10Effie Mouzeli: Add cookbook to build and upload Scap releases to apt.wm.o [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm) [07:35:18] (03Abandoned) 10Effie Mouzeli: safe-service-restart: Only verify in scope services [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [07:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json [07:35:56] (03Abandoned) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [07:36:18] (03Abandoned) 10Effie Mouzeli: mcrouter: update comments in mcrouter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692371 (owner: 10Effie Mouzeli) [07:36:26] (03Abandoned) 10Effie Mouzeli: [DNM] mcrouter: Add priorityClassName option to the daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009458 (owner: 10Effie Mouzeli) [07:37:07] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain [07:37:11] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:37:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882740 (10ops-monitoring-bot) VM netflow7001.magru.wmnet switching disk type to plain [07:37:34] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10882741 (10Gehel) [07:37:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain [07:37:53] PROBLEM - Bird Internet Routing Daemon on durum7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:37:53] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [07:38:13] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:38:53] RECOVERY - Bird Internet Routing Daemon on durum7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:38:53] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7001 is OK: OK: UP (pid=2367) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [07:39:11] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:39:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json [07:39:49] (03CR) 10Effie Mouzeli: [C:03+2] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [07:41:24] (03PS12) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [07:41:27] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [07:44:44] (03Merged) 10jenkins-bot: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:48:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:48:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:48:45] (03PS1) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1153543 [07:48:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json [07:49:28] (03PS2) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1153543 [07:49:46] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153543 (owner: 10Effie Mouzeli) [07:50:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json [07:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:53:48] (03CR) 10Effie Mouzeli: [C:03+2] profile::kubernetes::mediawiki_runner: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1153543 (owner: 10Effie Mouzeli) [07:55:03] (03PS13) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [07:55:07] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [07:56:23] (03CR) 10Nikerabbit: "Leaving for someone else to deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153292 (owner: 10PipelineBot) [07:57:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json [08:02:36] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org [08:03:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org [08:05:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json [08:05:57] !log installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) [08:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:36] (03CR) 10Ayounsi: [C:03+2] gNMI: add target down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1153030 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:06:50] (03CR) 10Ayounsi: [C:03+2] Add alerting for gNMIc Go routines [alerts] - 10https://gerrit.wikimedia.org/r/1153091 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:07:28] fyi, 3 gNMIc alerts should show up in the next 30min, it's expected [08:08:15] (03Merged) 10jenkins-bot: gNMI: add target down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1153030 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:08:44] (03Merged) 10jenkins-bot: Add alerting for gNMIc Go routines [alerts] - 10https://gerrit.wikimedia.org/r/1153091 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:10:48] (03PS1) 10Marostegui: db2224: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153549 (https://phabricator.wikimedia.org/T395989) [08:10:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2224 T395989', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json [08:11:01] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [08:11:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance [08:12:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10882796 (10MoritzMuehlenhoff) [08:12:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json [08:12:58] (03CR) 10Marostegui: [C:03+2] db2224: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153549 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [08:13:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10882799 (10Stevemunene) Thanks @Jclark-ctr Moving to the next steps [08:14:51] !log removing atlas7001 from magru01 cluster T394263 [08:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:54] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [08:15:45] (03CR) 10Jaime Nuche: "Thank you so much for following up on this" [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [08:15:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990 (10MatthewVernon) 03NEW [08:16:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10882812 (10MatthewVernon) p:05Triage→03High This is blocking other work on ms-codfw at the moment. [08:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json [08:18:30] (03CR) 10Cathal Mooney: hiera: Replace lvs1017 with lvs1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [08:20:22] (03CR) 10Cathal Mooney: "Sure let me add a comment. The reason I changed it is I realised it's inaccurate to view it just as the server addressing (vlans or serve" [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:20:45] FIRING: pageTest: testPage #page - https://alerts.wikimedia.org/?q=alertname%3DpageTest [08:20:56] wut? [08:21:02] vgutierrez: it's a test [08:21:22] it worked :D [08:21:24] (03PS2) 10Cathal Mooney: IBGP_OUT policy: rename last term and also export statics [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) [08:21:27] nope [08:21:48] the test was not for IRC... [08:21:56] see -private for discussion [08:23:22] FIRING: [2x] GnmiTargetDown: cloudsw2-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:23:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:24:15] (03PS1) 10Marostegui: s6 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1153550 (https://phabricator.wikimedia.org/T383795) [08:25:28] (03CR) 10Cathal Mooney: [C:03+2] IBGP_OUT policy: rename last term and also export statics [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:25:56] (03CR) 10Marostegui: [C:03+2] s6 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1153550 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [08:25:59] (03Merged) 10jenkins-bot: IBGP_OUT policy: rename last term and also export statics [homer/public] - 10https://gerrit.wikimedia.org/r/1153274 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:26:43] (03PS1) 10Alexandros Kosiaris: tlsproxy: Specify a default retry_on policy [puppet] - 10https://gerrit.wikimedia.org/r/1153551 (https://phabricator.wikimedia.org/T380958) [08:27:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json [08:27:56] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#10882832 (10MoritzMuehlenhoff) Looks good, we already have megacli and hpssacli in the existing rules. If while we're at it, let's also add storcli? [08:28:21] !log Change s6 codfw dbmaint to SBR T383795 [08:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:23] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [08:28:51] (03CR) 10CI reject: [V:04-1] tlsproxy: Specify a default retry_on policy [puppet] - 10https://gerrit.wikimedia.org/r/1153551 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [08:29:10] (03CR) 10Vgutierrez: [C:03+1] "thx for taking care of this <3" [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [08:31:51] (03PS2) 10Alexandros Kosiaris: tlsproxy: Specify a default retry_on policy [puppet] - 10https://gerrit.wikimedia.org/r/1153551 (https://phabricator.wikimedia.org/T380958) [08:32:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json [08:33:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882846 (10MoritzMuehlenhoff) [08:37:07] (03CR) 10Ayounsi: [C:03+2] gNMI: bump number of workers to 32 [puppet] - 10https://gerrit.wikimedia.org/r/1153096 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:38:04] !log revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 [08:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:05] !log Change s6 eqiad dbmaint to SBR T383795 [08:38:07] !log removing ganeti7001 from magru01 cluster T394263 [08:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:08] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [08:38:08] (03PS1) 10Marostegui: s6 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1153552 (https://phabricator.wikimedia.org/T383795) [08:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:10] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [08:38:39] (03CR) 10Marostegui: [C:03+2] s6 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1153552 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [08:39:07] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [08:40:48] (03Merged) 10jenkins-bot: jobqueue: Set the host header in all jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [08:40:55] PROBLEM - ganeti-confd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [08:40:55] PROBLEM - ganeti-noded running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [08:40:59] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153551 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [08:41:19] (03PS1) 10Muehlenhoff: Reimage ganeti7001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1153554 (https://phabricator.wikimedia.org/T394263) [08:41:27] ^ ganeti7001 is expecte [08:41:59] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [08:42:05] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [08:42:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T395241)', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json [08:42:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:42:51] !log deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. T395451 [08:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:54] T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system - https://phabricator.wikimedia.org/T395451 [08:43:07] (03CR) 10Muehlenhoff: [C:03+2] Reimage ganeti7001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1153554 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:43:22] RESOLVED: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:43:35] FIRING: ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:31] FIRING: Emergency syslog message: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:47:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json [08:48:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [08:48:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T395241)', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json [08:49:30] RESOLVED: Emergency syslog message: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:49:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10882899 (10Stevemunene) a:03Stevemunene [08:50:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10882902 (10Stevemunene) a:03Stevemunene [08:51:27] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm [08:51:37] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10882905 (10Stevemunene) [08:51:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10882906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ganeti7001.magru.wmnet with OS bookworm [08:51:58] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153140 (owner: 10Muehlenhoff) [08:52:31] FIRING: Emergency syslog message: Alert for device lsw1-c6-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:52:49] (03CR) 10Fabfur: [C:03+1] "that's fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/1152083 (owner: 10Giuseppe Lavagetto) [08:53:12] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [08:53:22] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:53:32] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [08:55:21] (03PS1) 10Jon Harald Søby: Remove white outline from Wikifunctions favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) [08:56:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T395241)', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json [08:57:11] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [08:57:18] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [09:00:51] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: bump limits for eventrouter in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153320 (owner: 10Hnowlan) [09:01:32] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:02:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance [09:02:27] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:02:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json [09:02:31] RESOLVED: Emergency syslog message: Device lsw1-c6-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json [09:02:51] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:03:44] (03CR) 10Hnowlan: [C:03+2] admin_ng: bump limits for eventrouter in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153320 (owner: 10Hnowlan) [09:03:59] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:05:44] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:05:58] (03PS1) 10Stevemunene: hdfs: Exclude group 7 and 8 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153560 (https://phabricator.wikimedia.org/T390174) [09:06:12] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-codfw [09:07:30] FIRING: Emergency syslog message: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:08:06] (03CR) 10Jon Harald Søby: "@jforrester@wikimedia.org If you're okay with this change, I can take it from there and schedule it for the next available deployment wind" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [09:08:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json [09:08:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw [09:09:08] jouncebot: nowandnext [09:09:08] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [09:09:08] In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1000) [09:10:16] !log installing qemu bugfix updates from Bookworm point release [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:19] (03Merged) 10jenkins-bot: admin_ng: bump limits for eventrouter in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153320 (owner: 10Hnowlan) [09:10:23] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:10:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json [09:11:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance [09:11:20] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:11:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json [09:12:30] RESOLVED: Emergency syslog message: Device asw1-b3-magru.mgmt.magru.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:12:45] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [09:13:22] RESOLVED: GnmiTargetDown: lsw1-f1-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:14:03] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:14:19] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:14:41] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:14:54] !log T395451 rollback the host header addition, this is erroring out, returning 3xx. [09:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:57] T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system - https://phabricator.wikimedia.org/T395451 [09:15:11] !log T395451 rollback the host header addition, this is erroring out, returning 404s. [09:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:21] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:15:49] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:16:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:16:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [09:16:23] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:17:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json [09:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json [09:19:04] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10882980 (10hashar) 05Open→03Resolved a:03hashar After chatting with Alexandros, the relicensing... [09:19:19] (03PS1) 10Majavah: maintain-dbusers: Fix querying current user grants [puppet] - 10https://gerrit.wikimedia.org/r/1153562 [09:19:19] (03PS1) 10Majavah: maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 [09:19:20] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10882986 (10Peachey88) [09:19:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1 T395983', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json [09:19:27] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [09:19:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance [09:20:05] !log Move datadir on pc2011 dbmaint pc1 codfw T395983 [09:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:13] (03PS2) 10Stevemunene: hdfs: Exclude group 7 and 8 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153560 (https://phabricator.wikimedia.org/T390174) [09:21:19] (03PS2) 10Majavah: maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 [09:22:12] (03CR) 10CI reject: [V:04-1] maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 (owner: 10Majavah) [09:22:48] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1152288 (owner: 10Giuseppe Lavagetto) [09:23:12] (03PS3) 10Majavah: maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 [09:23:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json [09:24:07] !log taavi@cumin1002 conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 [09:24:19] (03PS1) 10Majavah: Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153564 (https://phabricator.wikimedia.org/T390954) [09:25:27] (03CR) 10CI reject: [V:04-1] maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 (owner: 10Majavah) [09:26:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json [09:27:08] !log Move datadir on pc1011 dbmaint pc1 eqiad T395983 [09:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:10] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [09:27:40] (03CR) 10Alexandros Kosiaris: [C:03+2] "The PCC failure is just for puppet 5 (of which we got like a couple of hosts only) and it's a complete failure. I 've tested the resulting" [puppet] - 10https://gerrit.wikimedia.org/r/1153551 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [09:28:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1 T395983', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json [09:29:06] (03PS2) 10Majavah: Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153564 (https://phabricator.wikimedia.org/T390954) [09:29:06] (03PS1) 10Majavah: maintain-dbusers: Fix dict structure for max connection overrides [puppet] - 10https://gerrit.wikimedia.org/r/1153565 [09:30:08] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:30:19] (03CR) 10Fabfur: [C:03+1] "lgtm and godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:31:24] (03CR) 10CI reject: [V:04-1] maintain-dbusers: Fix dict structure for max connection overrides [puppet] - 10https://gerrit.wikimedia.org/r/1153565 (owner: 10Majavah) [09:32:07] (03PS2) 10Majavah: maintain-dbusers: Fix dict structure for max connection overrides [puppet] - 10https://gerrit.wikimedia.org/r/1153565 [09:32:08] (03PS3) 10Majavah: Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153564 (https://phabricator.wikimedia.org/T390954) [09:32:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json [09:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json [09:33:55] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [09:34:52] PROBLEM - snapshot of x3 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x3 at eqiad (db1216) taken on 2025-06-04 09:07:14 is 273 GiB, but the previous one was 1373 GiB, a change of -80.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:36:02] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [09:36:15] (03PS1) 10Marostegui: installserver: Allow reimage of db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1153566 (https://phabricator.wikimedia.org/T393296) [09:37:21] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:37:33] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:38:31] (03CR) 10Marostegui: [C:03+2] installserver: Allow reimage of db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1153566 (https://phabricator.wikimedia.org/T393296) (owner: 10Marostegui) [09:38:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json [09:40:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm [09:40:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10883031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ganeti7001.magru.wmnet with OS bookworm completed: - ganeti7... [09:41:05] (03CR) 10Ayounsi: [C:03+1] "one small comment otherwise lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1153172 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:41:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T395241)', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json [09:42:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [09:42:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:42:11] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [09:42:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T395241)', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json [09:43:35] RESOLVED: ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:12] (03PS1) 10Muehlenhoff: Assign ganeti_routed role to ganeti7001 and add it to the cluster node list [puppet] - 10https://gerrit.wikimedia.org/r/1153568 (https://phabricator.wikimedia.org/T394263) [09:44:26] (03CR) 10CI reject: [V:04-1] Assign ganeti_routed role to ganeti7001 and add it to the cluster node list [puppet] - 10https://gerrit.wikimedia.org/r/1153568 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:45:14] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: don't run purge-old-cx-drafts against test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1153171 (https://phabricator.wikimedia.org/T395892) (owner: 10Hnowlan) [09:46:02] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [09:46:52] !log T395451 deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. [09:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:55] T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system - https://phabricator.wikimedia.org/T395451 [09:46:59] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [09:47:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json [09:50:30] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:50:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T395241)', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json [09:50:49] you can ignore ^ it alerts on the ACK in librenms [09:51:29] (03CR) 10FNegri: [C:03+1] maintain-dbusers: Fix dict structure for max connection overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153565 (owner: 10Majavah) [09:51:41] (03CR) 10Majavah: [C:03+2] maintain-dbusers: Fix dict structure for max connection overrides [puppet] - 10https://gerrit.wikimedia.org/r/1153565 (owner: 10Majavah) [09:51:54] (03PS2) 10Muehlenhoff: Assign ganeti_routed role to ganeti7001 and add it to the cluster node list [puppet] - 10https://gerrit.wikimedia.org/r/1153568 (https://phabricator.wikimedia.org/T394263) [09:52:02] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:52:14] !log re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. T395451 [09:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:18] T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system - https://phabricator.wikimedia.org/T395451 [09:52:50] (03PS1) 10Effie Mouzeli: kubernetes:mediawiki_runner: include mediawiki::system_users [puppet] - 10https://gerrit.wikimedia.org/r/1153570 [09:52:52] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:53:19] (03PS14) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [09:53:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json [09:53:50] (03CR) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [09:54:10] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [09:55:16] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:55:48] (03PS4) 10Majavah: Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153564 (https://phabricator.wikimedia.org/T390954) [09:55:48] (03PS1) 10Majavah: maintain-dbusers: Revert overly strict type [puppet] - 10https://gerrit.wikimedia.org/r/1153571 [09:55:51] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:56:38] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1153562 (owner: 10Majavah) [09:56:47] (03CR) 10Majavah: [C:03+2] maintain-dbusers: Fix querying current user grants [puppet] - 10https://gerrit.wikimedia.org/r/1153562 (owner: 10Majavah) [09:59:22] (03PS2) 10Effie Mouzeli: kubernetes:mediawiki_runner: include mediawiki::system_users [puppet] - 10https://gerrit.wikimedia.org/r/1153570 [09:59:26] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153570 (owner: 10Effie Mouzeli) [09:59:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:59:41] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153570 (owner: 10Effie Mouzeli) [09:59:45] (03CR) 10FNegri: [C:03+1] maintain-dbusers: Revert overly strict type [puppet] - 10https://gerrit.wikimedia.org/r/1153571 (owner: 10Majavah) [09:59:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1000) [10:00:07] !log depool lvs1013 before switching to katran - T395228 [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:12] T395228: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228 [10:00:13] (03CR) 10Vgutierrez: [C:03+2] hiera: Depool lvs1013 before switching to katran [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:02:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json [10:03:01] (03CR) 10Ayounsi: [C:03+1] Assign ganeti_routed role to ganeti7001 and add it to the cluster node list [puppet] - 10https://gerrit.wikimedia.org/r/1153568 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:03:37] (03CR) 10Fabfur: [C:03+1] hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:04:09] !log upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - T395228 [10:04:10] (03PS2) 10Gergő Tisza: logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) [10:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:37] (03CR) 10Gergő Tisza: "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [10:05:25] (03PS2) 10Gergő Tisza: logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) [10:05:33] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran [10:05:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json [10:06:13] 06SRE, 06Infrastructure-Foundations, 10netops: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998 (10cmooney) 03NEW p:05Triage→03Low [10:06:59] 06SRE, 06Infrastructure-Foundations, 10netops: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998#10883105 (10cmooney) [10:08:36] (03CR) 10Ladsgroup: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [10:08:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json [10:09:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [10:09:56] (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti_routed role to ganeti7001 and add it to the cluster node list [puppet] - 10https://gerrit.wikimedia.org/r/1153568 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:10:30] RESOLVED: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:11:09] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10883127 (10cmooney) Ok thanks guys. Let me see if I can prep a patch to remove it where we currently are. It would clear up my proposed IBGP patch quite a bit... [10:12:08] (03CR) 10Clément Goubert: [C:03+1] kubernetes:mediawiki_runner: include mediawiki::system_users [puppet] - 10https://gerrit.wikimedia.org/r/1153570 (owner: 10Effie Mouzeli) [10:12:30] (03CR) 10Vgutierrez: [C:03+2] hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:12:38] (03PS3) 10Vgutierrez: hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) [10:13:38] jmm@cumin1003 drain-node (PID 153007) is awaiting input [10:13:58] (03PS1) 10Ladsgroup: mariadb: Comment out future sections [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) [10:14:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:15:35] (03CR) 10Majavah: [C:03+2] maintain-dbusers: Revert overly strict type [puppet] - 10https://gerrit.wikimedia.org/r/1153571 (owner: 10Majavah) [10:15:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:16:10] (03PS3) 10Effie Mouzeli: kubernetes:mediawiki_runner: include mediawiki::system_users [puppet] - 10https://gerrit.wikimedia.org/r/1153570 [10:16:15] ^^ that's me... apparently downtiming lvs1013 wasn't enough [10:16:28] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes:mediawiki_runner: include mediawiki::system_users [puppet] - 10https://gerrit.wikimedia.org/r/1153570 (owner: 10Effie Mouzeli) [10:17:22] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [10:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json [10:19:57] (03PS2) 10Ladsgroup: mariadb: Comment out future sections [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) [10:20:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json [10:23:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [10:23:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [10:23:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json [10:23:56] (03CR) 10Clément Goubert: [C:03+1] hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:24:43] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:25:28] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [10:25:31] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [10:25:54] (03CR) 10FNegri: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [10:28:27] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy, no special database tables are needed on the wiki due to `wmgCampaignEventsUseCentralDB` being `true`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) (owner: 10Mhorsey) [10:29:15] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [10:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json [10:35:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [10:35:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [10:35:24] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [10:36:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T395241)', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json [10:36:23] !log failover ganeti master in ulsfo to ganeti4005 [10:36:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T395241)', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json [10:37:44] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B [10:37:50] (03CR) 10Majavah: [C:03+2] Reapply "hieradata: cloudlb: Move x3 VIP to new x3 backend" [puppet] - 10https://gerrit.wikimedia.org/r/1153564 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [10:38:08] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B [10:38:38] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:44:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T395241)', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json [10:44:51] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:45:22] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:48:25] (03PS1) 10Alexandros Kosiaris: mw-jobrunner: VirtualHost priority to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153578 (https://phabricator.wikimedia.org/T395451) [10:50:10] ACKNOWLEDGEMENT - snapshot of s8 in codfw on backupmon1001 is CRITICAL: Last snapshot for s8 at codfw (db2198) taken on 2025-06-04 01:38:26 is 1088 GiB, but the previous one was 1351 GiB, a change of -19.5 % Jcrespo expected x3 setup https://phabricator.wikimedia.org/T384274 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:50:10] ACKNOWLEDGEMENT - snapshot of s8 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s8 at eqiad (db1171) taken on 2025-06-04 01:52:23 is 1216 GiB, but the previous one was 1587 GiB, a change of -23.3 % Jcrespo expected x3 setup https://phabricator.wikimedia.org/T384274 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:50:10] ACKNOWLEDGEMENT - snapshot of x3 in codfw on backupmon1001 is CRITICAL: Last snapshot for x3 at codfw (db2200) taken on 2025-06-04 07:00:30 is 334 GiB, but the previous one was 1351 GiB, a change of -75.3 % Jcrespo expected x3 setup https://phabricator.wikimedia.org/T384274 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:50:10] ACKNOWLEDGEMENT - snapshot of x3 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x3 at eqiad (db1216) taken on 2025-06-04 09:07:14 is 273 GiB, but the previous one was 1373 GiB, a change of -80.1 % Jcrespo expected x3 setup https://phabricator.wikimedia.org/T384274 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:51:41] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B [10:52:02] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B [10:53:23] (03CR) 10Clément Goubert: [C:03+2] k8s-controller-sidecar: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153305 (owner: 10Clément Goubert) [10:53:32] jouncebot: nowandnext [10:53:32] For the next 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1000) [10:53:32] In 0 hour(s) and 6 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1100) [10:54:59] (03PS1) 10FNegri: wikireplicas: centralize max_connections values [puppet] - 10https://gerrit.wikimedia.org/r/1153579 [10:57:29] (03PS1) 10Samtar: IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153581 (https://phabricator.wikimedia.org/T377975) [10:58:19] (03PS1) 10Muehlenhoff: Remove ganeti7001 Hiera config for old ganeti01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153582 (https://phabricator.wikimedia.org/T394263) [10:58:59] (03CR) 10Marostegui: "Check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135385 that broke things, so please double check what's needed there." [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [10:59:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json [10:59:58] (03Merged) 10jenkins-bot: k8s-controller-sidecar: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153305 (owner: 10Clément Goubert) [11:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1100). [11:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:54] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:04:21] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:04:41] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:04:56] (03CR) 10Samwilson: [C:03+1] IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153581 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [11:05:07] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:05:20] (03CR) 10Clément Goubert: [C:03+1] mw-jobrunner: VirtualHost priority to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153578 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [11:05:36] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:06:00] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:06:14] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:06:25] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:07:08] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-jobrunner: VirtualHost priority to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153578 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [11:07:14] (03CR) 10Alexandros Kosiaris: [C:03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153578 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [11:08:32] (03Merged) 10jenkins-bot: mw-jobrunner: VirtualHost priority to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153578 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [11:09:42] (03PS1) 10Marostegui: db2193: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153584 (https://phabricator.wikimedia.org/T395989) [11:09:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2193 T395989', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json [11:09:59] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [11:10:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance [11:10:59] (03CR) 10Marostegui: [C:03+2] db2193: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153584 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [11:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json [11:14:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:42] !log Deployed k8s-controller-sidecars version 1.0.2-3 [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json [11:18:03] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153109 (owner: 10PipelineBot) [11:18:35] FIRING: ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:32] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153109 (owner: 10PipelineBot) [11:20:08] (03PS1) 10Majavah: toolforge: wmcs-package-build: Fix Aptly host name [puppet] - 10https://gerrit.wikimedia.org/r/1153586 [11:20:08] (03PS1) 10Majavah: toolforge: wmcs-package-build: Remove unneeded escape [puppet] - 10https://gerrit.wikimedia.org/r/1153587 (https://phabricator.wikimedia.org/T396004) [11:20:41] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [11:21:49] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:22:50] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:22:54] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:24:31] (03CR) 10Marostegui: "I don't have much visibility on how this process is done outside of mariadb, so if you think it is fine, then go for it." [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [11:25:14] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:25:22] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:26:05] jmm@cumin1003 drain-node (PID 159696) is awaiting input [11:27:42] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [11:28:19] PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:29:03] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:29:25] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json [11:29:34] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10883351 (10MoritzMuehlenhoff) [11:30:04] (03PS1) 10Majavah: P:toolforge: aptly: Install rsync for backups [puppet] - 10https://gerrit.wikimedia.org/r/1153588 [11:30:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T395241)', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json [11:30:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [11:30:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T395241)', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json [11:30:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10883354 (10Jclark-ctr) Racking locations for ms-be109[2–5]. I could possibly fit: 1 in A2 1 in A7 1 in E8 1 in F8 [11:31:01] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:31:03] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:31:18] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:32:38] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet [11:32:59] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:33:29] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:34:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [11:34:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [11:34:40] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:34:43] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:35:17] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:35:38] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:35:44] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:36:54] (03CR) 10Ayounsi: [C:03+1] Remove ganeti7001 Hiera config for old ganeti01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153582 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:37:47] jouncebot: nowandnext [11:37:48] For the next 0 hour(s) and 22 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1100) [11:37:48] In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300) [11:37:49] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10883367 (10Jclark-ctr) @VRiley-WMF noticed these devices have not yet been added to Netbox following their receipt. Can you please add them T382370 [11:38:32] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [11:38:47] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10883373 (10Jclark-ctr) a:03VRiley-WMF [11:38:49] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:38:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T395241)', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json [11:38:53] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti7001 Hiera config for old ganeti01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153582 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:38:54] fyi going to deploy a config patch [11:39:10] ack [11:39:54] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10883376 (10cmooney) >>! In T393614#10881864, @Jhancock.wm wrote: > yeah same rack. I just need someone to migrate it for me since cloud is more complex tha... [11:40:25] (err actually one moment.. o.o) [11:42:48] ok phew :D [11:43:22] 06SRE, 06Infrastructure-Foundations, 10netops: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998#10883383 (10ayounsi) Good idea! in theory not particularly difficult, but we should look at reducing the load (go routines) on the current gNMIc instances first. [11:43:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153581 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [11:44:08] jmm@cumin1003 decommission (PID 161607) is awaiting input [11:44:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json [11:45:09] (03Merged) 10jenkins-bot: IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153581 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [11:45:32] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1153581|IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] [11:45:35] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [11:46:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:47:36] !log samtar@deploy1003 samtar: Backport for [[gerrit:1153581|IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:47:38] * TheresNoTime testing [11:49:30] (03PS1) 10Reedy: GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153591 (https://phabricator.wikimedia.org/T388531) [11:49:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10883414 (10cmooney) >>! In T394333#10881532, @Andrew wrote: > Among other things, the connection speed for 1048 looks pretty wrong; we were hoping this wo... [11:49:43] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:50:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10883418 (10MatthewVernon) Just to record what we talked about on IRC - the plan is for now to put two nodes into A (one each in A2 and A7), and at the sa... [11:50:49] (03PS1) 10Reedy: captcha.py: Expand variables and user in filenames [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153592 (https://phabricator.wikimedia.org/T395810) [11:51:00] !log samtar@deploy1003 samtar: Continuing with sync [11:51:08] (03PS1) 10Reedy: captcha.py: Check if output dir exists, and attempt to create it (else error) [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153593 (https://phabricator.wikimedia.org/T395804) [11:51:21] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm [11:51:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10883424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ganeti7001.magru.wmnet with OS bookworm [11:51:56] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [11:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:52:51] (03PS1) 10Effie Mouzeli: mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) [11:52:52] (03PS1) 10Reedy: captcha.py: Bail out if no words were read from wordlist [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153595 (https://phabricator.wikimedia.org/T395809) [11:53:06] (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153591 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [11:53:11] (03CR) 10Reedy: [C:03+2] captcha.py: Expand variables and user in filenames [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153592 (https://phabricator.wikimedia.org/T395810) (owner: 10Reedy) [11:53:18] (03CR) 10Reedy: [C:03+2] captcha.py: Check if output dir exists, and attempt to create it (else error) [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153593 (https://phabricator.wikimedia.org/T395804) (owner: 10Reedy) [11:53:23] (03CR) 10Reedy: [C:03+2] captcha.py: Bail out if no words were read from wordlist [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153595 (https://phabricator.wikimedia.org/T395809) (owner: 10Reedy) [11:53:37] (03PS1) 10Cathal Mooney: Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) [11:53:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json [11:53:59] (03PS2) 10Effie Mouzeli: mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) [11:54:06] (03CR) 10CI reject: [V:04-1] Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) (owner: 10Cathal Mooney) [11:55:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [11:55:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:55:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet [11:55:19] RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:55:40] (03PS2) 10Cathal Mooney: Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) [11:55:58] jouncebot: nowandnext [11:55:59] For the next 0 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1100) [11:55:59] In 1 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300) [11:56:37] Reedy: TheresNoTime is finishing up a backport [11:56:40] then we can go [11:56:49] (only a few minutes left) [11:58:00] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153581|IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) [11:58:06] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [11:58:15] claime: done [11:58:29] TheresNoTime: ty [11:59:21] (03PS1) 10Muehlenhoff: Remove krb1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1153597 (https://phabricator.wikimedia.org/T396007) [11:59:32] Reedy: whenever you're ready [11:59:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json [12:00:00] claime: sometime after CI is [12:00:05] lol [12:00:40] (03PS3) 10Cathal Mooney: Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) [12:02:20] (03CR) 10Muehlenhoff: [C:03+2] Remove krb1001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1153597 (https://phabricator.wikimedia.org/T396007) (owner: 10Muehlenhoff) [12:04:18] (03PS3) 10Effie Mouzeli: mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) [12:04:54] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153591 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [12:04:55] (03Merged) 10jenkins-bot: captcha.py: Expand variables and user in filenames [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153592 (https://phabricator.wikimedia.org/T395810) (owner: 10Reedy) [12:04:57] (03Merged) 10jenkins-bot: captcha.py: Check if output dir exists, and attempt to create it (else error) [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153593 (https://phabricator.wikimedia.org/T395804) (owner: 10Reedy) [12:04:58] (03Merged) 10jenkins-bot: captcha.py: Bail out if no words were read from wordlist [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153595 (https://phabricator.wikimedia.org/T395809) (owner: 10Reedy) [12:06:56] (03PS1) 10Reedy: GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153598 (https://phabricator.wikimedia.org/T388531) [12:07:03] (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153598 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [12:09:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json [12:12:48] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [12:13:20] (03PS1) 10Clément Goubert: mw-cron: Enable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153602 (https://phabricator.wikimedia.org/T395436) [12:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json [12:16:30] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [12:16:51] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:17:05] jouncebot: now [12:17:05] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [12:17:08] jouncebot: next [12:17:08] In 0 hour(s) and 42 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300) [12:17:37] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153598 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [12:17:41] there we go [12:17:53] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission krb1001.eqiad.wmnet - https://phabricator.wikimedia.org/T396007#10883530 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [12:17:54] effie: we're about to deploy the captcha changes [12:18:04] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10883540 (10MoritzMuehlenhoff) [12:18:11] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10883541 (10MoritzMuehlenhoff) All done! [12:18:22] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1153591|GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592|captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593|captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595|captcha.py: Bail out if no words were read [12:18:22] from wordlist (T395809)]], [[gerrit:1153598|GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty (T388531)]] [12:18:23] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10883542 (10MoritzMuehlenhoff) 05Open→03Resolved [12:18:24] claime: cool np [12:18:28] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [12:18:28] T395810: captcha.py: Doesn't like use of ~/filename - https://phabricator.wikimedia.org/T395810 [12:18:28] T395804: captcha.py: Gracefully handle output dir not existing - https://phabricator.wikimedia.org/T395804 [12:18:28] T395809: captcha.py: Error if wordlist provided, but empty - https://phabricator.wikimedia.org/T395809 [12:18:48] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:20:18] (03CR) 10Ayounsi: [C:03+1] "nice! lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) (owner: 10Cathal Mooney) [12:20:30] !log reedy@deploy1003 reedy: Backport for [[gerrit:1153591|GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592|captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593|captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595|captcha.py: Bail out if no words were read from wordlist (T3 [12:20:31] 95809)]], [[gerrit:1153598|GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty (T388531)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:20:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153604 [12:21:09] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:21:12] !log reedy@deploy1003 reedy: Continuing with sync [12:21:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:22:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance [12:22:52] (03PS1) 10Marostegui: db2217: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153605 (https://phabricator.wikimedia.org/T395989) [12:23:02] (03CR) 10Cathal Mooney: [C:03+2] Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) (owner: 10Cathal Mooney) [12:23:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2217 T395989', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json [12:23:06] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [12:23:37] (03Merged) 10jenkins-bot: Templates: replace 'section' macro with include statements [homer/public] - 10https://gerrit.wikimedia.org/r/1153596 (https://phabricator.wikimedia.org/T395555) (owner: 10Cathal Mooney) [12:24:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T395241)', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json [12:24:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [12:24:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T395241)', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json [12:24:58] (03CR) 10Marostegui: [C:03+2] db2217: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153605 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [12:25:04] k8s is 50% done [12:25:15] (03CR) 10Fabfur: [C:03+1] haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 (owner: 10Giuseppe Lavagetto) [12:25:16] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" [12:25:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" [12:25:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:25:23] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:26:05] Reedy: so everything is 50% done, there's barely anything left that isn't k8s [12:26:18] just mwmaint [12:26:39] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors [12:26:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors [12:27:11] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors [12:27:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors [12:27:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:04] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10883615 (10cmooney) >>! In T393102#10876497, @Jhancock.wm wrote: > @Andrew not sure why but i can't get it to pxe at all anymore. Can you take a look for m... [12:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json [12:28:13] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153591|GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592|captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593|captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595|captcha.py: Bail out if no words were rea [12:28:13] d from wordlist (T395809)]], [[gerrit:1153598|GenerateFancyCaptchas: Don't try and delete captchas if the filename is empty (T388531)]] (duration: 09m 51s) [12:28:18] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [12:28:18] T395810: captcha.py: Doesn't like use of ~/filename - https://phabricator.wikimedia.org/T395810 [12:28:18] T395804: captcha.py: Gracefully handle output dir not existing - https://phabricator.wikimedia.org/T395804 [12:28:19] T395809: captcha.py: Error if wordlist provided, but empty - https://phabricator.wikimedia.org/T395809 [12:29:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10883638 (10cmooney) 05Open→03Resolved a:03cmooney [12:29:16] claime: should be GTG... [12:29:38] Reedy: Awesome thanks [12:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json [12:29:58] effie: you can go ahead I think [12:30:03] jouncebot: nowandnext [12:30:04] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [12:30:04] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300) [12:30:12] claime: tx [12:30:23] (03PS1) 10Clément Goubert: mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1153609 [12:30:41] I'll see if I have time to test between your deployment and the backport window, if not I'll test after the window [12:31:04] claime: Just to point out... Without the --delete, the --fill won't do anything [12:31:04] back in a jiff [12:31:14] so you might want to say up 10000 to 11000 or 12000 [12:31:22] Reedy: argh good catch, yeah [12:31:29] I forgot that [12:31:35] :) [12:32:02] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:32:06] andrew@cumin1002 reimage (PID 453410) is awaiting input [12:32:15] (03PS2) 10Clément Goubert: mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1153609 [12:32:17] this process x) [12:32:37] heh [12:32:37] (03PS1) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [12:32:46] anyway, bbiab, thanks again Reedy [12:33:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T395241)', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json [12:33:21] (03CR) 10Reedy: [C:03+1] mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1153609 (owner: 10Clément Goubert) [12:34:06] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [12:34:24] (03PS2) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [12:35:27] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" [12:35:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" [12:35:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:35:58] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet [12:36:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:36:08] !log installing modsecurity-apache security updates [12:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:13] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage for gitlab-artifacts in production [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:36:28] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:37:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:38:04] (03Merged) 10jenkins-bot: mcrouter ds: allow mw-mcrouter ds to run on mw-experimental nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153594 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:38:41] (03CR) 10Btullis: [C:03+1] hdfs: Exclude group 7 and 8 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153560 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene) [12:39:36] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:39:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm [12:39:45] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:39:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10883670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ganeti7001.magru.wmnet with OS bookworm completed: - ganeti7... [12:41:42] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5759/console" [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [12:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.691s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:29] this is not me I am afraid [12:42:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet [12:42:40] (03CR) 10FNegri: [V:03+1] "I did a PCC on one cloudbd and one sanitarium, both are noops." [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [12:43:03] (03Abandoned) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 [homer/public] - 10https://gerrit.wikimedia.org/r/1153172 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [12:43:09] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet [12:43:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json [12:43:15] (03CR) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1153172 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [12:45:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.691s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:48:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77052 and previous config saved to /var/cache/conftool/dbconfig/20250604-124812-fceratto.json [12:48:28] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10883699 (10Jelto) Total number of artifacts was reduced from 400k to around 100k in T395014. I enabled object storage for the arti... [12:48:42] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10883700 (10Jelto) [12:48:50] (03PS4) 10Majavah: maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 [12:49:14] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: fix indentation in daemonset.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153613 (owner: 10Effie Mouzeli) [12:49:28] (03PS3) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [12:49:43] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1004.eqiad.wmnet [12:50:06] (03Merged) 10jenkins-bot: mcrouter: fix indentation in daemonset.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153613 (owner: 10Effie Mouzeli) [12:50:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:51:47] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [12:52:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:52:44] (03PS1) 10Andrew Bogott: Add cloud-instances-octavia-lb-mgmt-net for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) [12:53:13] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:53:22] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:54:49] some memcached errors are expected [12:54:58] (03PS1) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [12:56:32] (03PS2) 10Andrew Bogott: Add cloud-instances-octavia-lb-mgmt-net for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) [12:56:39] (03PS2) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [12:57:45] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-05-21-192515 to 2025-06-03-205630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153616 (https://phabricator.wikimedia.org/T394314) [12:57:51] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-21-192453 to 2025-06-03-231524 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153617 (https://phabricator.wikimedia.org/T394314) [12:57:59] (03PS3) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [12:58:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json [12:59:45] (03CR) 10Ssingh: hiera: Replace lvs1017 with lvs1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300). nyaa~ [13:00:05] HouseOfM and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] * James_F waves. [13:00:13] I’m in a meeting, can’t deploy [13:00:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5760/co" [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [13:00:26] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:00:30] Can anyone else? [13:01:06] I'll do it. [13:01:13] thanks! [13:01:27] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [13:01:54] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:01:56] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5761/co" [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [13:02:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) (owner: 10Mhorsey) [13:02:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153385 (https://phabricator.wikimedia.org/T128546) (owner: 10Jforrester) [13:02:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 (owner: 10Jforrester) [13:02:04] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) (owner: 10Jforrester) [13:02:08] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:02:10] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:02:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:02:26] PROBLEM - Hadoop NodeManager on an-worker1198 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:02:43] (03CR) 10Majavah: [V:03+1 C:03+1] Add cloud-instances-octavia-lb-mgmt-net for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [13:03:18] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B [13:03:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json [13:03:22] (03Merged) 10jenkins-bot: release CampaignEvents to cbk-zam wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) (owner: 10Mhorsey) [13:03:25] (03Merged) 10jenkins-bot: Bump portals to the 2025-06-02 09:23:11+00:00 build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153385 (https://phabricator.wikimedia.org/T128546) (owner: 10Jforrester) [13:03:26] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:03:28] (03Merged) 10jenkins-bot: build: Rename the rarely-used 'typos' script to 'checkTypos' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 (owner: 10Jforrester) [13:03:39] (03Merged) 10jenkins-bot: Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) (owner: 10Jforrester) [13:03:52] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [13:04:03] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1146628|release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385|Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781|build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751|Drop Chart roll-out dblists, no longer needed (T383079)]] [13:04:06] (03PS1) 10Andrew Bogott: Update keystone policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153619 (https://phabricator.wikimedia.org/T396013) [13:04:09] T393604: Enable Extension:CampaignEvents on cbk-zam.wikipedia.org - https://phabricator.wikimedia.org/T393604 [13:04:09] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:04:10] T383079: Epic: Deploy Charts to other wikis - https://phabricator.wikimedia.org/T383079 [13:04:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B [13:05:29] (03CR) 10Majavah: [C:03+1] Update keystone policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153619 (https://phabricator.wikimedia.org/T396013) (owner: 10Andrew Bogott) [13:05:37] (03CR) 10Andrew Bogott: [C:03+2] Update keystone policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153619 (https://phabricator.wikimedia.org/T396013) (owner: 10Andrew Bogott) [13:05:51] (03PS20) 10Ssingh: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [13:06:05] (03CR) 10Ssingh: "Added Bug #, no code change since last reviewed PS." [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [13:06:12] !log jforrester@deploy1003 jforrester, mhorsey: Backport for [[gerrit:1146628|release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385|Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781|build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751|Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia [13:06:13] .org/wiki/Mwdebug). Changes can now be verified there. [13:07:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:07:26] !log jforrester@deploy1003 jforrester, mhorsey: Continuing with sync [13:07:42] (03PS1) 10Giuseppe Lavagetto: analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) [13:08:26] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:08:35] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:08:36] (03PS1) 10Vgutierrez: Revert "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1153621 [13:09:05] τηισ ισ με [13:09:08] this is m [13:09:56] (03CR) 10CI reject: [V:04-1] analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) (owner: 10Giuseppe Lavagetto) [13:10:32] (03CR) 10Ssingh: [C:03+1] Revert "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1153621 (owner: 10Vgutierrez) [13:11:29] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1153621 (owner: 10Vgutierrez) [13:11:36] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [13:11:36] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015 (10MoritzMuehlenhoff) 03NEW [13:11:44] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [13:11:49] (03CR) 10Clément Goubert: [C:03+1] mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:12:05] (03CR) 10Marostegui: [C:03+1] wikireplicas: centralize max_connections values [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [13:12:42] o/ [13:13:08] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10883806 (10ssingh) Sounds good, thanks. Let me know if I can help with anything. [13:13:10] jouncebot: now [13:13:10] For the next 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1300) [13:13:17] (03CR) 10Andrew Bogott: [C:03+2] Add cloud-instances-octavia-lb-mgmt-net for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1153614 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [13:13:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json [13:13:32] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:13:35] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:13:47] (03CR) 10Alexandros Kosiaris: [C:04-1] mediawiki: add tolerations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:14:12] apologies I wasn't here on the hour, is anyone deploying? [13:14:17] HouseOfM: there is a mcrouter rolling restart going on, but it will be almos done by the time you run scap [13:14:21] or anyone [13:14:22] HouseOfM: It's deployed. [13:14:29] (03CR) 10FNegri: [V:03+1 C:03+2] wikireplicas: centralize max_connections values [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [13:14:33] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146628|release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385|Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781|build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751|Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) [13:14:35] Oh cool, thanks! [13:14:35] effie: Deployment seems to have gone fine. [13:14:37] T393604: Enable Extension:CampaignEvents on cbk-zam.wikipedia.org - https://phabricator.wikimedia.org/T393604 [13:14:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:14:38] T383079: Epic: Deploy Charts to other wikis - https://phabricator.wikimedia.org/T383079 [13:15:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:15:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:55] That's great, tysm [13:16:20] * Lucas_WMDE is free now for a bit if needed [13:16:32] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:16:39] (03PS4) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [13:16:50] Lucas_WMDE: Only thing is that my portals deploy didn't magically empty the CDN cache; do you know if there's a special command, or do we just wait? [13:17:30] FIRING: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [13:17:42] I don’t know anything about portals and am not sure which cache you’re referring to [13:17:49] but you might be looking for the purgeList maintenance script? [13:17:53] (03CR) 10CI reject: [V:04-1] mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:18:02] Lucas_WMDE: Yeah, maybe. Will look after my meetings. [13:18:04] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [13:18:22] (03CR) 10Alexandros Kosiaris: [C:04-1] mw-experimental: initial commit (vanilla) (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:18:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T395241)', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json [13:18:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [13:18:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T395241)', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json [13:19:35] (03CR) 10Alexandros Kosiaris: [C:03+1] "Sigh, I just realized I 've never read the commit message. Disregard." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:19:47] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-experimental: initial commit (vanilla) (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:20:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:20:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye [13:20:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10883825 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1095.eqiad.wmnet with OS bullseye [13:21:08] (03PS1) 10Samtar: IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153623 (https://phabricator.wikimedia.org/T377975) [13:21:39] !log sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] James_F: portals/urls-to-purge.txt looks useful ^^ [13:22:33] (03PS1) 10Clément Goubert: shellbox-constraints: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153624 [13:22:39] though the only SAL search result for it is almost nine years old o_O https://sal.toolforge.org/log/AVez8mRnX4d8bmU7pyye [13:23:07] (03CR) 10Alexandros Kosiaris: [C:04-1] mw-experimental: create new service #6 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:23:43] !log starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: T288106 [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [13:24:01] Lucas_WMDE: are you deploying? (and/or is there any deployment going on?) [13:24:48] I’m not deploying and not aware of anything else either [13:24:54] ack [13:25:03] effie was doing something but said it should be almost done (and that was some minutes ago) [13:25:11] andrew@cumin1002 reimage (PID 453410) is awaiting input [13:25:23] * TheresNoTime has to undo a whoopsy [13:25:26] RECOVERY - Hadoop NodeManager on an-worker1198 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:25:42] I am almost done [13:26:02] (03PS5) 10Alexandros Kosiaris: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:26:38] (03PS1) 10Muehlenhoff: Add site.pp entries for netflow7002, ncredir7004, install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153625 (https://phabricator.wikimedia.org/T394263) [13:26:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T395241)', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json [13:26:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153623 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [13:27:29] (03CR) 10CI reject: [V:04-1] mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:27:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [13:27:45] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet [13:27:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet [13:28:02] (03CR) 10Ssingh: [C:03+2] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (https://phabricator.wikimedia.org/T288106) (owner: 10BCornwall) [13:28:25] effie: should I wait until you're done to do a config deployment? [13:28:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json [13:28:37] jclark@cumin1002 reimage (PID 512204) is awaiting input [13:28:42] TheresNoTime: I am done, thank yo [13:28:47] :D [13:29:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153623 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [13:29:16] (03PS1) 10Gergő Tisza: Use GetSecurityLogContext hook for goodpass/badpass logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) [13:29:49] yeah the alert is still firing but the errors have subsided [13:29:50] (03Merged) 10jenkins-bot: IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153623 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [13:29:53] !log forcing agent run on cp6015: CR 1114074 [13:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1153623|IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] [13:30:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:30:16] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [13:30:26] (03CR) 10Muehlenhoff: "a" [puppet] - 10https://gerrit.wikimedia.org/r/1153625 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:30:52] There's the RESOLVE :) [13:30:54] !log forcing agent run on cp7001 (single BE node): CR 1114074 [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:08] (03PS3) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [13:32:22] !log samtar@deploy1003 samtar: Backport for [[gerrit:1153623|IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:28] 06SRE, 10SRE-Access-Requests: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10883928 (10cmassaro) Thank you! [13:32:35] * TheresNoTime testing ^ [13:33:19] !log samtar@deploy1003 samtar: Continuing with sync [13:33:20] (03PS6) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [13:34:46] (03PS4) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [13:36:32] (03CR) 10Ayounsi: [C:03+1] Add site.pp entries for netflow7002, ncredir7004, install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153625 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:37:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye [13:37:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10883951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1094.eqiad.wmnet with OS bullseye [13:37:21] !log forcing agent run on cp2037 (non-single BE node): CR 1114074 [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:01] (03PS7) 10Alexandros Kosiaris: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:38:15] (03CR) 10Muehlenhoff: [C:03+2] Add site.pp entries for netflow7002, ncredir7004, install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153625 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:38:17] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [13:38:30] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet [13:39:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10883953 (10MoritzMuehlenhoff) [13:39:44] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage [13:39:52] (03CR) 10CI reject: [V:04-1] mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:40:11] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153623|IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) [13:40:13] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [13:40:52] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [13:40:55] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet [13:41:30] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [13:41:33] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet [13:41:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json [13:42:36] (03PS1) 10Vgutierrez: Revert "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153627 [13:43:28] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [13:43:31] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1153609 (owner: 10Clément Goubert) [13:43:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10884018 (10Jclark-ctr) [13:43:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json [13:43:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage [13:43:57] (03CR) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [13:44:55] (03CR) 10Ssingh: [C:03+1] Revert "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153627 (owner: 10Vgutierrez) [13:45:20] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153627 (owner: 10Vgutierrez) [13:45:57] (03CR) 10Muehlenhoff: [C:03+2] Remove unused option to enable host-based auth [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:46:13] !log forcing ats-backend-restart on cp1104 [13:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:42] (03CR) 10Majavah: [C:03+1] profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [13:47:25] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [13:48:32] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [13:48:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [13:49:12] !log sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': T288106 [13:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:14] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [13:50:48] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:51:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:51:37] !log Manual run of generatecaptcha on mw-cron, no delete - T388531 [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:45] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [13:52:30] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2041 MB (3% inode=95%): /tmp 2041 MB (3% inode=95%): /var/tmp 2041 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [13:54:29] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [13:55:05] (03PS8) 10Alexandros Kosiaris: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:56:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage [13:57:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json [13:57:55] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: add tolerations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:58:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [13:59:36] (03CR) 10Muehlenhoff: [C:03+2] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1148342 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1400) [14:01:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage [14:02:05] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:02:35] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:03:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-05-21-192515 to 2025-06-03-205630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153616 (https://phabricator.wikimedia.org/T394314) (owner: 10Jforrester) [14:04:29] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [14:04:43] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-05-21-192515 to 2025-06-03-205630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153616 (https://phabricator.wikimedia.org/T394314) (owner: 10Jforrester) [14:04:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [14:05:23] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:53] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:06:27] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884146 (10ssingh) [14:06:30] (03CR) 10Hnowlan: [C:03+1] shellbox-constraints: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153624 (owner: 10Clément Goubert) [14:06:55] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:06:56] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:07:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet [14:07:47] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:49] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:53] !log sukhe@cumin1002 START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org [14:08:22] !log decommissioning doh7001 and durum7001: T396015 [14:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] T396015: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015 [14:09:48] effie: I'm getting a helm fatal when trying to deploy new versions of our wikifunctions charts to prod — `Error: execution error at (function-orchestrator/templates/configmap.yaml:2:3): Pool wf-codfw has no failover servers list, route /local/wf`. Is this a known/temporary thing or should I file a task? [14:10:16] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10884152 (10VRiley-WMF) I have created an account with servertech.com, I will be opening up a ticket and investigate how to proceed from this point. Will update once I gather more information. [14:10:17] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:52] ^ expected [14:11:13] (03PS1) 10Ladsgroup: trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) [14:11:30] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [14:11:41] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [14:12:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T395241)', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json [14:12:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [14:12:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T395241)', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json [14:13:40] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [14:14:39] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [14:14:43] effie: Filed as https://phabricator.wikimedia.org/T396033 [14:15:46] (03PS1) 10Clément Goubert: mediawiki: Fix captcha wordlists path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153634 (https://phabricator.wikimedia.org/T388531) [14:16:21] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10884173 (10lmata) [14:16:32] (03CR) 10Clément Goubert: [C:03+2] shellbox-constraints: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153624 (owner: 10Clément Goubert) [14:16:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:17:06] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [14:17:50] andrew@cumin1002 reimage (PID 453410) is awaiting input [14:18:47] (03CR) 10Hnowlan: [C:03+1] logstash: drop thumbor unstructured logs [puppet] - 10https://gerrit.wikimedia.org/r/1153322 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [14:19:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [14:19:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye [14:19:25] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" [14:19:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" [14:19:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet [14:19:41] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884179 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1002 for hosts: `durum7001.magru.wmnet` - durum7001.magru.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanag... [14:19:52] (03CR) 10Hnowlan: [C:03+1] mediawiki: Fix captcha wordlists path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153634 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:20:33] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix captcha wordlists path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153634 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:20:43] (03CR) 10Jforrester: "It needs Design input, sadly. I'll ask." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [14:20:43] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [14:21:03] topranks: ok to remove your changes for codfw? [14:21:10] er, merge removal of changes in codfw I meant [14:21:19] netbox cookbook [14:21:23] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:21:27] -ssw2-a8-codfw [14:21:35] -anycast-gw-2056-codfw [14:21:46] sukhe: sorry didn't anticipate those [14:21:49] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:21:54] but yes I'm removing all that please proceeed [14:21:55] thanks! [14:21:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T395241)', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json [14:21:58] thanks! [14:22:39] topranks: the fun is back :P [14:22:44] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.kq7pthy9/zones/netbox/47.192.10.in-addr.arpa' [14:22:59] (03Merged) 10jenkins-bot: shellbox-constraints: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153624 (owner: 10Clément Goubert) [14:23:01] yeah that I did expect [14:23:13] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [14:23:18] I'm in the process of deleting more prefixes now, was gonna submit a patch when done with them removed... [14:23:19] yeah I see the removals [14:23:21] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:23:29] no worries, I will just skip this for now [14:23:44] thanks, won't be too long [14:23:45] but yes, let's remove that so authdns-update is unblocked (happy to review) [14:23:51] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" [14:24:12] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" [14:24:12] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:14] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org [14:24:19] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884207 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1002 for hosts: `doh7001.wikimedia.org` - doh7001.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanag... [14:24:38] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884209 (10ssingh) [14:24:55] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:25:06] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:25:09] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884211 (10ssingh) @Muehlenhoff: Both of these are decommissioned. Let me know if any other action is required from my end, thanks! [14:25:23] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:08] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:26:08] ^aux-k8s-etcd2004 is expected [14:26:21] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:27:21] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:27:41] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:28:03] (03Merged) 10jenkins-bot: mediawiki: Fix captcha wordlists path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153634 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:28:37] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:28:52] !log cgoubert@deploy1003 Started scap sync-world: 1153634: mediawiki: Fix captcha wordlists path - T388531 [14:28:56] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [14:29:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [14:29:34] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10884228 (10Jhancock.wm) i have 12 x 480GB drives readily available on site [14:29:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [14:30:25] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 316.06 ms [14:31:04] (03PS1) 10Giuseppe Lavagetto: wikifunctions: disable mcrouter failover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153638 (https://phabricator.wikimedia.org/T396033) [14:31:17] !log cgoubert@deploy1003 Finished scap sync-world: 1153634: mediawiki: Fix captcha wordlists path - T388531 (duration: 02m 24s) [14:31:20] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [14:31:52] 10ops-eqiad, 06SRE, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10884241 (10RobH) >>! In T395941#10882582, @ayounsi wrote: > That makes sens to me, but I'd prefer we purchase a new (unmanaged) switch and not re-use a de... [14:32:15] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:32:35] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:32:49] <_joe_> James_F: once CI runs, you should be unblocked [14:33:25] !log cgoubert@deploy1003 Started scap sync-world: 1153634: mediawiki: Fix captcha wordlists path - T388531 [14:33:35] RESOLVED: ProbeDown: Service ganeti2019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:39] *grmbl forgot to git pull* [14:34:28] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10884260 (10Jhancock.wm) Those are the new servers that i'm still getting racked. The one i need to move a port for is cloudcephosd2003-dev, connected to po... [14:34:30] (03CR) 10Giuseppe Lavagetto: [C:03+2] wikifunctions: disable mcrouter failover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153638 (https://phabricator.wikimedia.org/T396033) (owner: 10Giuseppe Lavagetto) [14:35:49] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10884262 (10Jhancock.wm) oh whoops. ty! [14:35:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:35:59] !log cgoubert@deploy1003 Finished scap sync-world: 1153634: mediawiki: Fix captcha wordlists path - T388531 (duration: 02m 33s) [14:36:01] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [14:36:02] (03Merged) 10jenkins-bot: wikifunctions: disable mcrouter failover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153638 (https://phabricator.wikimedia.org/T396033) (owner: 10Giuseppe Lavagetto) [14:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json [14:37:43] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:38:03] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:38:38] jouncebot: now [14:38:38] For the next 0 hour(s) and 21 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1400) [14:38:42] (03PS1) 10Jcrespo: dbbackups: Upgrade s6, s2 to 10.11 and produce new backups on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) [14:38:52] jouncebot: next [14:38:52] In 2 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1700) [14:39:49] James_F: this is should be ok now, are you planning to run scap? [14:41:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [14:42:38] James_F: ping me when you are back if you are to run scap [14:43:37] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [14:44:09] (03PS1) 10Ladsgroup: beta: Add config for w.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153643 (https://phabricator.wikimedia.org/T396012) [14:44:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:44:26] (03CR) 10Jcrespo: [C:04-1] "Do not merge until s2 codfw migration is more or less finished (specially the primary)." [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) (owner: 10Jcrespo) [14:45:07] (03PS9) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [14:45:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:46:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:46:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye [14:46:12] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:46:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10884320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1095.eqiad.wmnet with OS bullseye completed... [14:46:15] (03PS10) 10Effie Mouzeli: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) [14:46:27] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [14:46:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:47:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [14:47:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [14:49:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:49:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:49:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye [14:49:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10884331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1094.eqiad.wmnet with OS bullseye completed... [14:49:56] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" [14:50:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" [14:50:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:06] (03PS1) 10Cathal Mooney: Remove include statements for ranges used in temp Nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1153645 (https://phabricator.wikimedia.org/T385217) [14:50:15] (03PS7) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [14:51:03] (03CR) 10SBassett: [C:03+1] Use GetSecurityLogContext hook for goodpass/badpass logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [14:52:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10884344 (10Jclark-ctr) [14:52:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json [14:52:48] (03CR) 10Ssingh: [C:03+1] Remove include statements for ranges used in temp Nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1153645 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:52:53] jouncenot: nowandnext [14:52:58] jouncebot: nowandnext [14:52:58] For the next 0 hour(s) and 7 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1400) [14:52:58] In 2 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1700) [14:54:11] (03CR) 10Cathal Mooney: [C:03+2] Remove include statements for ranges used in temp Nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1153645 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [14:54:30] !log cmooney@dns2005 START - running authdns-update [14:54:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10884352 (10Stevemunene) Added the Raid0 config with ` stevemunene@an-worker1163:~$ sudo perccli6... [14:54:57] (03PS8) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [14:55:06] !log cmooney@dns2005 END - running authdns-update [14:56:03] Dreamy_Jazz: I am planning to run scap if that is ok [14:56:23] Sure. I was wanting to deploy a config change but it can wait if needed. [14:56:25] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [14:56:28] (03CR) 10FNegri: [V:03+1 C:03+2] "I dropped one of the 4 users with a custom `max_connections`, and let maintain-dbusers recreate it. It was recreated successfully with the" [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [14:56:30] (03PS1) 10Dreamy Jazz: Set wgCheckUserDisableCheckUserAPI to false on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153646 (https://phabricator.wikimedia.org/T396010) [14:56:33] cool thanks [14:56:43] (03PS1) 10Clément Goubert: mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) [14:56:52] (03CR) 10CI reject: [V:04-1] mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:56:57] (03PS2) 10Clément Goubert: mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) [14:57:55] (03PS2) 10Dreamy Jazz: Set wgCheckUserDisableCheckUserAPI to false on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153646 (https://phabricator.wikimedia.org/T396010) [14:58:05] !log installing Linux 5.10.237 on Bullseye hosts [14:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:07] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet [14:58:52] (03PS2) 10Ladsgroup: beta: Add config for w.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153643 (https://phabricator.wikimedia.org/T396012) [14:58:56] (03Merged) 10jenkins-bot: mediawiki: add tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153611 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [14:59:38] (03PS1) 10MVernon: ms-be eqiad: add 2 new backends, drain 2 old ones [puppet] - 10https://gerrit.wikimedia.org/r/1153648 (https://phabricator.wikimedia.org/T393104) [15:00:01] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet [15:00:21] (03PS9) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [15:00:49] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:00:52] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:00:59] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:46] !log jiji@deploy1003 Started scap sync-world: T276994: Chart bump, noop [15:02:48] T276994: Provide an mwdebug functionality on kubernetes (mw-experimental) - https://phabricator.wikimedia.org/T276994 [15:04:02] (03CR) 10FNegri: [V:03+1 C:03+2] "I dropped & recreated that user again (with a new password) because I inadvertently pasted the previous password hash in the comment above" [puppet] - 10https://gerrit.wikimedia.org/r/1153579 (owner: 10FNegri) [15:04:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:04:50] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [15:04:58] (03CR) 10Jcrespo: [C:03+1] ms-be eqiad: add 2 new backends, drain 2 old ones [puppet] - 10https://gerrit.wikimedia.org/r/1153648 (https://phabricator.wikimedia.org/T393104) (owner: 10MVernon) [15:05:03] (03PS3) 10Clément Goubert: mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) [15:05:39] !log jiji@deploy1003 Finished scap sync-world: T276994: Chart bump, noop (duration: 02m 52s) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10884415 (10cmooney) >>! In T385217#10879725, @Jhancock.wm wrote: > @cmooney I'm gonna reply to Jorge's email about boxes and... [15:07:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T395241)', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json [15:07:34] (03CR) 10Effie Mouzeli: mw-experimental: create new service #6 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:07:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [15:07:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T395241)', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json [15:07:57] (03CR) 10Cwhite: [C:03+2] logstash: drop thumbor unstructured logs [puppet] - 10https://gerrit.wikimedia.org/r/1153322 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [15:08:07] (03PS5) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [15:08:11] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:09:06] (03PS6) 10CDobbins: add rest of south amer (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [15:10:38] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141425 (owner: 10PipelineBot) [15:10:41] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144533 (owner: 10PipelineBot) [15:10:44] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147757 (owner: 10PipelineBot) [15:10:46] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149748 (owner: 10PipelineBot) [15:10:48] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149751 (owner: 10PipelineBot) [15:10:50] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150704 (owner: 10PipelineBot) [15:10:52] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152670 (owner: 10PipelineBot) [15:11:05] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132075 (owner: 10PipelineBot) [15:11:08] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151196 (owner: 10PipelineBot) [15:11:10] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716 (owner: 10PipelineBot) [15:11:32] effie: Are you done with the scap deploy? [15:12:19] (03CR) 10MVernon: [C:03+2] ms-be eqiad: add 2 new backends, drain 2 old ones [puppet] - 10https://gerrit.wikimedia.org/r/1153648 (https://phabricator.wikimedia.org/T393104) (owner: 10MVernon) [15:12:29] Dreamy_Jazz: yes thank you [15:12:42] Thanks. Will proceed with my config change. [15:13:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153646 (https://phabricator.wikimedia.org/T396010) (owner: 10Dreamy Jazz) [15:13:50] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10884467 (10MatthewVernon) [15:13:53] Time to try out Spiderpig! [15:14:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:14:27] (03Merged) 10jenkins-bot: Set wgCheckUserDisableCheckUserAPI to false on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153646 (https://phabricator.wikimedia.org/T396010) (owner: 10Dreamy Jazz) [15:14:32] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough [15:14:49] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1153646|Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] [15:14:53] T396010: Disable the CheckUser API on all WMF wikis except from loginwiki - https://phabricator.wikimedia.org/T396010 [15:15:07] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum [15:15:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T395241)', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json [15:16:00] (03CR) 10Effie Mouzeli: [C:03+1] profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [15:16:34] (03PS1) 10Ssingh: cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:58] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1153646|Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:17:47] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:18:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:29] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:18:43] (03CR) 10Krinkle: [C:04-1] SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [15:18:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:18:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:19:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:19:27] (03CR) 10Dreamy Jazz: [C:04-1] SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [15:20:25] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [15:20:49] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [15:20:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10884483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [15:22:29] (03CR) 10Ssingh: add rest of south amer (except Falkland Islands) to geo-maps (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:22:36] (03PS10) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [15:24:10] FIRING: [14x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:24:33] (03CR) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [15:24:52] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153646|Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) [15:24:55] T396010: Disable the CheckUser API on all WMF wikis except from loginwiki - https://phabricator.wikimedia.org/T396010 [15:27:04] (03PS5) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) [15:27:17] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:28:17] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:28:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:10] RESOLVED: [18x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:29:56] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [15:30:29] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [15:31:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json [15:32:27] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:32:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10884517 (10DMburugu) I approve [15:33:27] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:35:17] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:17] (03CR) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [15:37:17] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:38:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.181s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:39:55] FIRING: [20x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:40:17] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:17] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:42:54] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [15:43:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.181s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:44:40] RESOLVED: [14x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:44:55] FIRING: [18x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:45:59] jouncebot: nowandnext [15:45:59] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [15:45:59] In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1700) [15:46:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json [15:49:12] (03PS3) 10BCornwall: hiera: Replace lvs1017 with lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [15:49:22] (03CR) 10BCornwall: hiera: Replace lvs1017 with lvs1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [15:49:40] RESOLVED: [18x] BFDdown: BFD session down between cr1-codfw and 208.80.153.38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:49:51] (03PS4) 10BCornwall: hiera: Replace lvs1017 with lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [15:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:54:35] (03CR) 10Volans: [C:04-1] "One minor error, LGTM otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [15:55:31] (03CR) 10Ssingh: cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [15:56:17] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:56:26] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:56:32] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:56:34] (03PS2) 10Ssingh: cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 [15:56:49] (03CR) 10Ssingh: cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [15:58:17] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:58:25] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Increase hash sizes [puppet] - 10https://gerrit.wikimedia.org/r/1153327 (owner: 10BCornwall) [16:00:47] 06SRE, 06serviceops: Silence RESTGatewayBackendErrorsHigh for envoy_cluster_name: mobileapps_cluster - https://phabricator.wikimedia.org/T394609#10884661 (10hnowlan) 05Open→03Resolved a:03hnowlan [16:01:17] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T395241)', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json [16:02:17] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:02:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10884669 (10Jhancock.wm) okay cool. I'm gonna unrack them tomorrow and get them boxed. i replied to Nokia's email asking for pac... [16:03:25] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet [16:03:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10884677 (10ops-monitoring-bot) Host an-worker1163.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [16:03:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10884678 (10Ahoelzl) Approved. [16:04:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10884680 (10Ahoelzl) [16:04:14] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:04:26] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:05:17] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:06:05] (03CR) 10Volans: [C:03+1] "LGTM, thanks for cleaning up them." [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [16:06:17] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:07:37] (03CR) 10Scott French: "I think you also need to update [0] in this same patch, to switch the volume mount over to the single config map." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [16:07:39] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet [16:07:51] PROBLEM - Hadoop NodeManager on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:08:03] (03CR) 10Ssingh: [V:03+2 C:03+2] cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [16:08:23] PROBLEM - Hadoop DataNode on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [16:09:27] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:55] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet [16:11:27] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:11:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough [16:12:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:12:38] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum [16:12:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet [16:12:46] (03CR) 10Jasmine: "adding to the above, in the off chance that tj is assigned to a different language in the future or that there does arise a conflict, this" [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [16:13:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:13:40] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet [16:14:21] (03Merged) 10jenkins-bot: cookbooks/sre: use A:dnsbox everywhere instead of -rec/-auth [cookbooks] - 10https://gerrit.wikimedia.org/r/1153649 (owner: 10Ssingh) [16:14:24] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet [16:14:52] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:16:05] (03PS2) 10AOkoth: add codfw to os-reports in service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1152308 (https://phabricator.wikimedia.org/T350794) [16:16:48] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet [16:18:27] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet [16:18:29] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:18:42] (03CR) 10AOkoth: [C:03+2] add codfw to os-reports in service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1152308 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [16:18:59] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet [16:19:18] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet [16:20:24] vriley@cumin1002 netbox (PID 695500) is awaiting input [16:21:52] (03PS1) 10Jelto: Revert "gitlab: enable object storage for gitlab-artifacts in production" [puppet] - 10https://gerrit.wikimedia.org/r/1153655 (https://phabricator.wikimedia.org/T378922) [16:22:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10884819 (10Stevemunene) for an-worker1164 I ran ` sudo cookbook sre.hadoop.init-hadoop-workers -... [16:22:41] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [16:22:47] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [16:22:47] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10884865 (10Jhancock.wm) @MatthewVernon the drive has been replaced. It wasn't under warranty so no need to get dell involved lol. LMK if it all looks good on your end. [16:26:17] (03CR) 10Dreamy Jazz: "(Not sure I know enough about how auth.wikimedia.org config works, so will leave review to others)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [16:27:32] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet [16:27:46] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet [16:29:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10884902 (10Stevemunene) [16:30:45] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [16:30:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [16:32:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:33:17] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:34:55] (03CR) 10Cathal Mooney: [C:03+1] "Logic looks good, I'm not the best at reading the tests but LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1151658 (https://phabricator.wikimedia.org/T388641) (owner: 10Majavah) [16:36:04] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [16:36:24] (03PS4) 10Clément Goubert: mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) [16:40:22] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10884971 (10elukey) @herron I got some feedback from the ML team (for the revert risk pilot) that it was nice to see the raw values for the SLO ta... [16:41:04] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [16:41:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10884983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [16:45:15] (03CR) 10Clément Goubert: "Good catch, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [16:45:23] (03CR) 10Clément Goubert: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [16:52:31] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2026 MB (3% inode=95%): /tmp 2026 MB (3% inode=95%): /var/tmp 2026 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [16:53:19] PROBLEM - Host ms-fe2016 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:55:33] RECOVERY - Host ms-fe2016 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [16:55:36] (03CR) 10Majavah: [C:03+2] team-wmcs: Add host-bound alerts for BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151658 (https://phabricator.wikimedia.org/T388641) (owner: 10Majavah) [16:56:50] (03Merged) 10jenkins-bot: team-wmcs: Add host-bound alerts for BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151658 (https://phabricator.wikimedia.org/T388641) (owner: 10Majavah) [16:59:06] hi folks, fundraising tech is aware of the flood of email notification spam from smashpig-failmail@wikimedia.org to fr-tech-failmail@wikimedia.org [16:59:32] We've just deployed something to throttle that category of notification down to 1 per 30 minutes [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1700) [17:00:26] if there are still a large number of messages clogging up queues, please feel free to drop anything with subject including "frpig1002 (gravy)" [17:00:49] (03CR) 10Hnowlan: [C:03+1] mw-cron: Enable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153602 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [17:04:53] (03PS1) 10Bking: flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) [17:09:38] (03CR) 10Scott French: [C:03+1] mediawiki: Fix captcha configmap structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [17:10:08] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [17:12:17] (03PS1) 10Bvibber: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153662 (https://phabricator.wikimedia.org/T395462) [17:12:27] (03PS1) 10Bvibber: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153663 (https://phabricator.wikimedia.org/T395462) [17:12:33] (03Merged) 10jenkins-bot: mediawiki: Fix captcha configmap structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153647 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [17:13:16] !log cgoubert@deploy1003 Started scap sync-world: 1153647: mediawiki: Fix captcha configmap structure - T388531 [17:13:19] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [17:15:56] !log cgoubert@deploy1003 Finished scap sync-world: 1153647: mediawiki: Fix captcha configmap structure - T388531 (duration: 02m 39s) [17:22:59] (03CR) 10Gmodena: [C:03+1] flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [17:34:12] (03CR) 10Tchanders: [C:03+1] "The patch looks good. We'll wait until the dependencies are stably deployed everywhere before deploying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) (owner: 10Dreamy Jazz) [17:35:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:36:42] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:46:01] (03CR) 10D3r1ck01: "Ack!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [17:48:06] (03PS2) 10Jdlrobson: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153662 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:48:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10885276 (10wiki_willy) Thanks @Marostegui! [17:48:14] (03PS2) 10Jdlrobson: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153663 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:48:28] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:48:51] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:49:58] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: sync [17:50:15] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync [17:50:20] if no objection gonna run a couple quick backports in prep for a later service deploy [17:50:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153662 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:50:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153663 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:52:50] (03Merged) 10jenkins-bot: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153662 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:52:52] (03Merged) 10jenkins-bot: Update Charts so they can render from data-mw-charts as well as data-charts [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153663 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [17:53:16] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org mailing list - https://phabricator.wikimedia.org/T396062 (10Jgreen) 03NEW [17:53:18] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1153662|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] [17:53:21] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [17:53:24] (03PS1) 10Btullis: flink-operator: Bump the CPU and RAM in the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) [17:54:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063 (10cmooney) 03NEW p:05Triage→03Medium [17:55:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#10885329 (10cmooney) [17:55:23] (03PS2) 10Btullis: flink-operator: Bump the CPU and RAM in the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) [17:55:30] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1153662|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:56:26] !log bvibber@deploy1003 bvibber: Continuing with sync [17:57:20] (03CR) 10Gmodena: flink-operator: Bump the CPU and RAM in the dse-k8s cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) (owner: 10Btullis) [17:59:37] (03CR) 10Btullis: flink-operator: Bump the CPU and RAM in the dse-k8s cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) (owner: 10Btullis) [18:00:01] (03CR) 10Gmodena: [C:03+1] flink-operator: Bump the CPU and RAM in the dse-k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) (owner: 10Btullis) [18:00:05] dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1800). [18:03:24] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153662|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663|Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) [18:03:25] (03CR) 10Btullis: [C:03+2] flink-operator: Bump the CPU and RAM in the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) (owner: 10Btullis) [18:03:27] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [18:03:38] done [18:07:49] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153670 (https://phabricator.wikimedia.org/T392174) [18:07:50] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153670 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [18:08:04] rolling rolling rolling [18:08:39] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153670 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [18:10:02] (03Merged) 10jenkins-bot: flink-operator: Bump the CPU and RAM in the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153668 (https://phabricator.wikimedia.org/T395984) (owner: 10Btullis) [18:11:15] (03PS1) 10Ladsgroup: sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) [18:13:38] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10885386 (10RobH) Ok, so the issue that keeps coming up with Dell is even though cp7001 may currently show an intake temp of 80F, that isn't an alarm generating temp. What is the error we're seeing from th... [18:13:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:14:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:15:35] !log puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: T288106 [18:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:37] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [18:16:12] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10885395 (10RobH) cp7001 note: It had a power loss event awhile back but should have cleared the 'warn' status and its odd it hasn't. I clearned the log (it only had the power loss event, no temp warnings)... [18:17:49] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [18:18:02] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs T392174 [18:18:05] T392174: 1.45.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T392174 [18:19:50] (03PS1) 10Lucas Werkmeister: beta cluster: Set $wgOATHAuthAccountPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153673 (https://phabricator.wikimedia.org/T396061) [18:21:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153673 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [18:22:15] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcontrol2010-dev service implementation - https://phabricator.wikimedia.org/T396064 (10Andrew) 03NEW [18:22:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10885418 (10Andrew) [18:22:57] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10885419 (10Andrew) 05Open→03Resolved This is now imaged and ready for service. [18:23:05] (03PS2) 10Ladsgroup: sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) [18:24:47] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:24:50] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:26:11] (03PS1) 10Lucas Werkmeister: beta cluster: Disable $wgOATHRequiredForGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) [18:26:59] (03PS1) 10Ssingh: sre/__init__:SREBatchRunnerBase: skip downtimed services by default [cookbooks] - 10https://gerrit.wikimedia.org/r/1153675 [18:27:22] (03CR) 10Lucas Werkmeister: "CCing people from Icb46c2d539. (I submitted the parent change for tonight’s backport+config window, but this one can stew and gather opini" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [18:27:47] (03CR) 10Ssingh: "Commit message has the reasoning. Feel free to reject this; in that case, I will just set this in the specific cookbook, overriding action" [cookbooks] - 10https://gerrit.wikimedia.org/r/1153675 (owner: 10Ssingh) [18:28:51] (03CR) 10Lucas Werkmeister: "Doing this in `CommonSettings.php` isn’t super nice, but it’s far from the first code in here to check `$wmgRealm` directly, and I don’t t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153673 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [18:29:10] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) [18:29:10] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org [18:29:50] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [18:36:05] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065 (10cmooney) 03NEW p:05Triage→03Low [18:37:33] (03CR) 10Ssingh: "No, scratch it. Bad idea. I read it again and looked at the flow." [cookbooks] - 10https://gerrit.wikimedia.org/r/1153675 (owner: 10Ssingh) [18:37:36] (03Abandoned) 10Ssingh: sre/__init__:SREBatchRunnerBase: skip downtimed services by default [cookbooks] - 10https://gerrit.wikimedia.org/r/1153675 (owner: 10Ssingh) [18:37:45] !log depooling cp7001 for CPU stress testing and temperature effects (T373993) [18:37:47] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.* [18:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:48] T373993: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993 [18:37:49] (03PS1) 10Bking: flink-operator: raise RAM limits/requests from 1GB to 4GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153676 (https://phabricator.wikimedia.org/T395984) [18:39:08] (03CR) 10Gmodena: [C:03+1] flink-operator: raise RAM limits/requests from 1GB to 4GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153676 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:39:18] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#10885456 (10cmooney) [18:41:30] (03PS7) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [18:42:08] (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [18:42:50] (03PS1) 10Dreamy Jazz: CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore [extensions/AbuseFilter] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153679 (https://phabricator.wikimedia.org/T396056) [18:43:05] (03CR) 10Bking: [C:03+2] flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:43:13] (03CR) 10CI reject: [V:04-1] flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:43:35] (03CR) 10Bking: flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:43:45] (03CR) 10Bking: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:44:22] (03CR) 10Bking: [C:03+2] flink-operator: raise RAM limits/requests from 1GB to 4GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153676 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [18:44:43] (03CR) 10Jelto: [C:03+2] Revert "gitlab: enable object storage for gitlab-artifacts in production" [puppet] - 10https://gerrit.wikimedia.org/r/1153655 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [18:45:05] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org [18:46:17] jouncebot: nowandnext [18:46:18] For the next 1 hour(s) and 13 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T1800) [18:46:18] In 1 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T2000) [18:46:36] Dreamy_Jazz: Sling it! [18:46:37] Can I deploy a fix for train blocker task https://phabricator.wikimedia.org/T392174? [18:46:40] Sure. [18:46:48] spiderpig here we come [18:46:58] (03CR) 10Dreamy Jazz: [C:03+2] CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore [extensions/AbuseFilter] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153679 (https://phabricator.wikimedia.org/T396056) (owner: 10Dreamy Jazz) [18:47:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153679 (https://phabricator.wikimedia.org/T396056) (owner: 10Dreamy Jazz) [18:49:07] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10885472 (10RobH) Opened 210926536 using cp3072 and its related cpu throttle as example for the ticket and appended @BCornwall into the ticket with Dell. Hopefully we'll see some feedback from Dell on how... [18:49:43] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:51:09] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.* [18:55:05] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org [18:57:10] (03Merged) 10jenkins-bot: CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore [extensions/AbuseFilter] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153679 (https://phabricator.wikimedia.org/T396056) (owner: 10Dreamy Jazz) [18:57:36] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1153679|CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] [18:57:39] T396056: PHP Deprecated: Use of MediaWiki\Revision\RevisionStore::getRevisionByTitle with a LinkTarget was deprecated in MediaWiki 1.45 when using Special:BlockedExternalDomains or editing a page - https://phabricator.wikimedia.org/T396056 [18:59:16] (03CR) 10AOkoth: [C:03+2] site: apply doc role to doc2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [18:59:19] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:59:32] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-21-192453 to 2025-06-04-185118 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153617 (https://phabricator.wikimedia.org/T391971) [18:59:42] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1153679|CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:59:52] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:00:01] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:00:05] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:00:18] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:00:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10885491 (10ssingh) Commenting on this with my own understanding and for review of others. After that, letting @BCornwall handle updating the task description. IMO the way we... [19:02:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:42] (03PS1) 10Bvibber: Update chart-renderer service for Parsoid template fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153680 (https://phabricator.wikimedia.org/T395462) [19:03:00] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [19:03:19] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:03:35] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:05:32] (03CR) 10Bvibber: [C:03+2] Update chart-renderer service for Parsoid template fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153680 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [19:07:02] (03Merged) 10jenkins-bot: Update chart-renderer service for Parsoid template fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153680 (https://phabricator.wikimedia.org/T395462) (owner: 10Bvibber) [19:07:42] (03PS3) 10Ladsgroup: sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) [19:08:13] (03PS2) 10Dzahn: Revert^2 "gerrit: add a second replica, start replicating to gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 (https://phabricator.wikimedia.org/T395887) [19:08:51] !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [19:09:21] !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [19:09:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:32] !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [19:10:02] !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [19:10:04] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153679|CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) [19:10:10] !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [19:10:14] T396056: PHP Deprecated: Use of MediaWiki\Revision\RevisionStore::getRevisionByTitle with a LinkTarget was deprecated in MediaWiki 1.45 when using Special:BlockedExternalDomains or editing a page - https://phabricator.wikimedia.org/T396056 [19:10:23] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885542 (10KFrancis) Hi all, the NDA has been signed. Thanks! [19:10:30] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:10:41] !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [19:10:47] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org [19:10:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) [19:11:15] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org [19:11:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org [19:11:23] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] [19:11:31] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] [19:12:41] !log sukhe@dns1004 START - running authdns-update [19:13:21] !log sukhe@dns1004 END - running authdns-update [19:13:35] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:13:45] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885557 (10Dzahn) @KFrancis Thank you. Has it been added to the NDA/MOU spreadsheet? [19:14:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:14:42] 10ops-codfw, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T396067 (10phaultfinder) 03NEW [19:15:41] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [19:16:37] (03PS4) 10Ladsgroup: sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) [19:22:14] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:23:03] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:27:31] (03CR) 10Ssingh: [C:03+1] "No, this is my bad. I got more context from the all the linked patches so yes this is fine." [puppet] - 10https://gerrit.wikimedia.org/r/1151308 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:28:17] (03CR) 10Ssingh: [C:03+1] search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:28:31] (03CR) 10Ssingh: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:29:17] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885633 (10KFrancis) Just added! [19:30:08] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885637 (10Dzahn) Confirmed:) thanks again. Moving the ticket forward. [19:30:49] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885641 (10Dzahn) [19:31:23] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885645 (10Dzahn) [19:31:59] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet [19:33:20] (03CR) 10Ssingh: [C:03+1] "I revisited this after your above comment. I personally prefer having distinct IPs but I also see that there are other cases in which they" [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:34:28] (03Abandoned) 10Eevans: WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [19:35:03] (03Abandoned) 10Eevans: cassandra-dev2001: upgrade to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/922609 (https://phabricator.wikimedia.org/T337344) (owner: 10Eevans) [19:36:00] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:36:02] (03Abandoned) 10Eevans: sessionstore: remove decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/998539 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [19:37:48] (03Abandoned) 10Eevans: cassandra: drop (unused) aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1043894 (https://phabricator.wikimedia.org/T313877) (owner: 10Eevans) [19:42:11] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet [19:48:44] (03PS1) 10Bartosz Dziewoński: Treat File::getShortDesc() as possibly unsafe HTML [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153686 (https://phabricator.wikimedia.org/T395834) [19:48:55] (03PS1) 10Bartosz Dziewoński: Treat File::getShortDesc() as possibly unsafe HTML [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153687 (https://phabricator.wikimedia.org/T395834) [19:49:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153686 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [19:49:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153687 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [19:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:52:22] (03PS1) 10Bartosz Dziewoński: SUL3: Retry local login on failure due to invalid/expired login token [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153689 (https://phabricator.wikimedia.org/T390784) [19:52:30] (03PS2) 10Bartosz Dziewoński: SUL3: Retry local login on failure due to invalid/expired login token [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153689 (https://phabricator.wikimedia.org/T390784) [19:52:46] (03PS1) 10Bartosz Dziewoński: SUL3: Retry local login on failure… (follow-ups) [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153690 (https://phabricator.wikimedia.org/T390784) [19:53:31] (03PS2) 10Bartosz Dziewoński: SUL3: Retry local login on failure… (follow-ups) [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153690 (https://phabricator.wikimedia.org/T390784) [19:53:43] (03PS1) 10Bartosz Dziewoński: SUL3: Retry local login on failure due to invalid/expired login token [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153691 (https://phabricator.wikimedia.org/T390784) [19:53:51] (03PS1) 10Bartosz Dziewoński: SUL3: Retry local login on failure… (follow-ups) [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153692 (https://phabricator.wikimedia.org/T390784) [19:54:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153689 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [19:54:12] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet [19:54:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153690 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [19:54:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153691 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [19:54:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153692 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [19:55:11] (03CR) 10Ssingh: [C:03+1] "Specifically in this case, you are using the search IPs so no new IPs should be added anyway. That is something I overlooked and hence 😊" [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T2000). [20:00:04] lucaswerkmeister and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:32] hi [20:01:23] i can't deploy, i'd appreciate if someone could ship my patches [20:02:18] Amir1 said he could deploy earlier [20:02:47] hi - i can deploy [20:03:10] yay, thanks [20:03:33] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet [20:03:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153673 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [20:04:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:23] MatmaRex: ok to ship your 2 together? [20:04:46] cjming: yeah, it can all go out at once [20:04:49] (03Merged) 10jenkins-bot: beta cluster: Set $wgOATHAuthAccountPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153673 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [20:05:01] and i can't really test anything on mwdebug [20:05:10] ack [20:05:11] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1153673|beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] [20:05:14] T396061: Inconsistent user permissions for users who were recently added to a new group (June 2025 edition) - https://phabricator.wikimedia.org/T396061 [20:06:26] (since one patch is for a rare error case, and the other is security hardening just-in-case) [20:07:18] !log cjming@deploy1003 lucaswerkmeister, cjming: Backport for [[gerrit:1153673|beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:24] one sec [20:08:06] oh. right. beta cluster [20:08:09] can’t test that on WikimediaDebug :D [20:08:20] ya [20:08:24] !log cjming@deploy1003 lucaswerkmeister, cjming: Continuing with sync [20:09:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:29] okay I was in a meeting, do you still need me? [20:09:30] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet [20:09:57] Amir1: nope, cjming is deploying <3 [20:10:04] Amazing <3 [20:10:33] hi Amir! you're always needed - but in this case, we got it covered [20:10:41] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet [20:11:29] (03PS1) 10Andrew Bogott: Upgrade Horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1153698 (https://phabricator.wikimedia.org/T393783) [20:11:31] (03PS1) 10Andrew Bogott: Upgrade Horizon in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1153699 (https://phabricator.wikimedia.org/T393783) [20:14:18] (03CR) 10Andrew Bogott: [C:03+2] Upgrade Horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1153698 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [20:15:00] (03PS1) 10Dzahn: admin: add user corvus and add them to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) [20:15:25] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153673|beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) [20:15:29] T396061: Inconsistent user permissions for users who were recently added to a new group (June 2025 edition) - https://phabricator.wikimedia.org/T396061 [20:15:40] (03CR) 10CI reject: [V:04-1] admin: add user corvus and add them to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) (owner: 10Dzahn) [20:16:34] (03PS1) 10Aleksandar Mastilovic: Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 [20:16:42] (03PS2) 10Bking: flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) [20:16:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153686 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [20:16:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153687 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [20:16:52] (03CR) 10CI reject: [V:04-1] flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [20:17:16] lucaswerkmeister: should be up on beta :) [20:17:42] yup, just tested it and it works :) [20:17:49] (03CR) 10JHathaway: [C:03+2] postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway) [20:17:49] yay! [20:17:52] (03PS2) 10Aleksandar Mastilovic: Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) [20:17:53] thanks for deploying! [20:18:01] np! [20:18:03] (03PS3) 10Bking: flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) [20:18:56] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet [20:20:27] (03PS2) 10Dzahn: admin: add user corvus and add them to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) [20:20:42] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet [20:22:33] (03Merged) 10jenkins-bot: Treat File::getShortDesc() as possibly unsafe HTML [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153686 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [20:22:41] (03Merged) 10jenkins-bot: Treat File::getShortDesc() as possibly unsafe HTML [core] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153687 (https://phabricator.wikimedia.org/T395834) (owner: 10Bartosz Dziewoński) [20:23:06] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1153686|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] [20:23:39] (03CR) 10AOkoth: [C:03+1] admin: add user corvus and add them to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) (owner: 10Dzahn) [20:25:16] !log cjming@deploy1003 cjming, matmarex: Backport for [[gerrit:1153686|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:26:13] cjming: seems good [20:26:46] i'm assuming Testserver checks failed is ok? [20:26:56] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet [20:26:57] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet [20:27:01] there is nothing specific to test, but the search pages that display this message still display correctly, e.g. https://commons.wikimedia.org/w/index.php?search=test+webm&title=Special%3ASearch&profile=advanced&fulltext=1&ns6=1 [20:27:04] cjming: what do you mean? [20:27:45] getting "check_testservers_baremetal-1_of_1 failed" [20:28:11] i have the option to retry testserver checks or continue with deployement or exit scap [20:28:15] leaning to continue [20:28:57] i've never seen that. looks like it comes from here in scap: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/ce19d3c2a7d7f13667a1ee85b74e449a4ac241cd/scap/main.py#L746 [20:30:12] which runs these tests, i guess: https://gerrit.wikimedia.org/g/operations/puppet/+/f132590ef3b80f74d472a1b616ccecadae72f78d/modules/scap/templates/scap.cfg.erb#125 [20:30:17] hmm - maybe i should retry checks [20:31:17] testing with the mwdebug browser extensions, the bare metal mwdebug servers seem to be up from here [20:31:22] maybe it was some transient issue [20:31:40] ya - erring on retrying -- will continue if prompted again [20:31:49] !log cjming@deploy1003 cjming, matmarex: Continuing with sync [20:35:02] if it gives you any details about the error, maybe we should file a bug. maybe someone broke the bare metal config somehow. [20:35:16] there was a phab task about that recently I think [20:35:20] and yeah retrying is fine [20:35:45] T380958 [20:35:45] T380958: httpb sometimes fails upon deployment with a HTTP 503 - https://phabricator.wikimedia.org/T380958 [20:37:08] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet [20:37:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission krb1001.eqiad.wmnet - https://phabricator.wikimedia.org/T396007#10885810 (10VRiley-WMF) 05Open→03Resolved [20:38:43] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153686|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687|Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) [20:39:44] lucaswerkmeister: thanks - added comment to ticket [20:39:55] MatmaRex: should be live :) [20:40:16] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet [20:40:53] cjming: thanks. the rest of my patches should also all go out together, if you have the time to deploy them [20:41:19] oh! ok - np [20:41:28] but i can also reschedule if we'd go too far outside the window [20:42:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153689 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:42:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153690 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:42:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153691 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:42:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153692 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:42:49] whoops - i forgot to check if they needed a rebase - hopefully not [20:44:12] no, this should be fine [20:44:41] cool [20:44:42] i set up the dependent patches in a stack, so they should merge [20:44:59] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet [20:46:13] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet [20:46:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10885828 (10VRiley-WMF) I have reseated the cable from the switch end for pay-lb1002. Would you be able to check to see if it's there? In Netbox, it d... [20:49:51] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10885837 (10Dzahn) a:05KFrancis→03None [20:49:54] (03Merged) 10jenkins-bot: SUL3: Retry local login on failure due to invalid/expired login token [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153689 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:49:58] (03Merged) 10jenkins-bot: SUL3: Retry local login on failure… (follow-ups) [extensions/CentralAuth] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1153690 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:50:04] (03Merged) 10jenkins-bot: SUL3: Retry local login on failure due to invalid/expired login token [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153691 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:50:15] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet [20:51:03] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:51:31] (03Merged) 10jenkins-bot: SUL3: Retry local login on failure… (follow-ups) [extensions/CentralAuth] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153692 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [20:51:58] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1153689|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]] [20:52:01] T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784 [20:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [20:54:09] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1153689|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to [20:54:09] the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:00] (03PS2) 10Gergő Tisza: Use GetSecurityLogContext hook for goodpass/badpass logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) [20:55:08] (03CR) 10Gergő Tisza: Use GetSecurityLogContext hook for goodpass/badpass logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [20:55:15] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:55:18] MatmaRex: presumably i can go ahead and sync? [20:55:42] cjming: yep [20:55:49] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:56:08] i just tested logging in on mwdebug and it still works fine, but there's no way to test whether the specific bug is fixed [20:56:22] but we should see it in the logs [20:59:12] vriley@cumin1002 provision (PID 775196) is awaiting input [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T2100) [21:01:12] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:02:41] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153689|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690|SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691|SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692|SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d [21:02:41] uration: 10m 42s) [21:02:46] T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784 [21:03:14] alrighty - !log End of UTC late backport window [21:03:49] thank you cjming [21:03:55] yw! [21:04:03] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet [21:04:06] !log end of UTC late backport window [21:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:49] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10885892 (10ifried) @SLyngshede-WMF, I approve this request. [21:05:06] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet [21:05:07] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet [21:06:53] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:19] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [21:07:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10885898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [21:10:13] (03CR) 10Bking: [C:03+2] flink-operator: remove misleading comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153660 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [21:11:07] PROBLEM - Host logstash2035 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:35] RECOVERY - Host logstash2035 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [21:12:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2055:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2055 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:14:46] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet [21:17:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2055:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2055 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:18:04] Hello there, any deployers around for T396075 ? (Patch coming) [21:18:04] T396075: Error: Typed property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$participationOptions must not be accessed before initialization - https://phabricator.wikimedia.org/T396075 [21:22:56] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet [21:24:36] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) [21:25:00] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet [21:25:01] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet [21:29:05] (03CR) 10SBassett: "I mean, it's still real CU data, at the very least, on the beta cluster, despite it being from a much smaller, perhaps almost synthetic us" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [21:29:07] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) [21:30:33] (03PS1) 10Bking: mw-content-history-reconcile-enrich/mw-content-history-reconcile-enrich-next: +RAM for jobMgr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153715 (https://phabricator.wikimedia.org/T395984) [21:35:13] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet [21:35:23] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10885995 (10VRiley-WMF) a:03VRiley-WMF [21:37:06] ^Patch for the train blocker is up, if there are deployers around: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1153719 [21:39:57] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet [21:40:12] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet [21:40:21] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet [21:40:43] (03PS1) 10NMW03: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) [21:41:28] (03CR) 10Gergő Tisza: "Done in Ic707d307da08d65e4036b7295b0707030cd609c2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [21:41:58] (03PS1) 10Ladsgroup: conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) [21:42:57] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet [21:43:11] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet [21:44:09] (03CR) 10Gergő Tisza: "(For the record, @bd808@wikimedia.org says there was no reason for the old behavior other than caution back when the Logstash infrastructu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [21:48:41] (03CR) 10Máté Szabó: [C:03+1] logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [21:50:48] (03PS1) 10Ladsgroup: beta: Update url shortener domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153724 (https://phabricator.wikimedia.org/T396012) [21:51:43] Daimona: around? I can deploy now [21:52:00] Yup, thanks [21:52:40] (03PS1) 10Ladsgroup: Bump cache key version in EventStore [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153725 (https://phabricator.wikimedia.org/T396075) [21:52:54] (03CR) 10Ladsgroup: [C:03+2] Bump cache key version in EventStore [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153725 (https://phabricator.wikimedia.org/T396075) (owner: 10Ladsgroup) [21:53:19] (03CR) 10Ladsgroup: [C:03+2] beta: Update url shortener domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153724 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [21:54:32] (03Merged) 10jenkins-bot: beta: Update url shortener domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153724 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [21:55:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153725 (https://phabricator.wikimedia.org/T396075) (owner: 10Ladsgroup) [21:58:51] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250604T2200) [22:02:04] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet [22:02:18] (03Merged) 10jenkins-bot: Bump cache key version in EventStore [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1153725 (https://phabricator.wikimedia.org/T396075) (owner: 10Ladsgroup) [22:02:32] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet [22:02:33] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet [22:02:45] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1153725|Bump cache key version in EventStore (T396075)]] [22:02:48] T396075: Error: Typed property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$participationOptions must not be accessed before initialization - https://phabricator.wikimedia.org/T396075 [22:04:53] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1153725|Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:06:40] (03PS6) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [22:06:58] (03CR) 10CI reject: [V:04-1] run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [22:09:06] (03PS3) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 [22:09:28] Amir1: uncaught error is gone, everything's looking good [22:09:38] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [22:09:48] awesome, moving forward! [22:10:29] noice, thank you [22:11:36] !log sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 [22:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:52] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet [22:12:15] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet [22:12:37] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet [22:12:52] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet [22:13:31] (03CR) 10Scott French: "Thanks, Effie. Forgot to mention on the last pass: Maybe delete the unnecessary per-release values files introduced in the base? (for cana" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [22:13:59] (03CR) 10Scott French: [C:03+1] mw-cron: Enable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153602 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [22:15:27] (03CR) 10BryanDavis: [C:04-1] "Sampling is only performed in MediaWiki\Logger\LegacyLogger today. That logging handler is not used in the Monolog stack setup by wmf-conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [22:16:40] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153725|Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) [22:16:43] T396075: Error: Typed property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$participationOptions must not be accessed before initialization - https://phabricator.wikimedia.org/T396075 [22:20:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10886069 (10MatthewVernon) @Jhancock.wm I fear the wrong disk has gone: ` Jun 4 16:22:52 ms-be2066 kernel: [31462174.362949] sd 0:2:5:0: [sdf] tag#606 FAILED Result: hostbyte=DID_B... [22:20:57] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet [22:21:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10886070 (10MatthewVernon) (i.e. the kernel thinks sdf got removed, not the bad sdr) [22:27:32] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [22:27:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10886073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [22:29:11] (03CR) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [22:30:38] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet [22:45:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir [22:48:58] (03PS2) 10Andrew Bogott: Upgrade Horizon in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1153699 (https://phabricator.wikimedia.org/T393783) [22:48:59] (03PS1) 10Andrew Bogott: Update codfw1dev Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1153733 [22:49:45] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1153733 (owner: 10Andrew Bogott) [22:54:11] (03CR) 10Andrew Bogott: [C:03+2] Upgrade Horizon in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1153699 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [23:01:16] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10886152 (10Ladsgroup) >>! In T379942#10879247, @Ladsgroup wrote: > The reason I didn't ping you is that when I got to ms-fe, all screens were terminated which might mean it was cut (... [23:21:54] (03CR) 10Xcollazo: [C:03+1] "Worth a try." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153715 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153740 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153740 (owner: 10TrainBranchBot) [23:49:06] (03CR) 10Bartosz Dziewoński: logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [23:49:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1153740 (owner: 10TrainBranchBot) [23:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:55:45] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir