[00:09:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570 [00:09:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570 (owner: 10TrainBranchBot) [00:10:51] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:45] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:24:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:25:17] PROBLEM - Disk space on analytics1070 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=95%): /tmp 2121 MB (3% inode=95%): /var/tmp 2121 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [00:30:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570 (owner: 10TrainBranchBot) [00:51:37] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9f101cedcb0befe856cdbcefb8a70401e9822d3e23a7ba6b169df3aa01fa0155/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:35] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2062 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [01:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:37] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:20:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:23:21] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:52:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:53:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:51] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:17:01] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2065 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2 [02:17:01] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:20:27] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2104 is CRITICAL: CRITICAL - ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:2 [02:20:27] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:41] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2107 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:03:39] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:03:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:05:07] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:05:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53801 bytes in 2.442 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:05:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:05:57] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:10:21] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2074 is CRITICAL: CRITICAL - mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:2 [03:10:21] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:23:43] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2114 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2 [03:23:43] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:42:49] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2115 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:35] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2089 is CRITICAL: CRITICAL - lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:14:41] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2068 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:21:43] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2059 is CRITICAL: CRITICAL - kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2 [04:21:43] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:22:45] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:24:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:40:35] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2088 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:2 [04:40:35] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:50:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:51:21] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:52:51] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2075 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:57:19] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2066 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:11:39] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:07] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:15:01] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:05] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2085 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:54:23] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:54:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:55:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:56:21] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:57:07] PROBLEM - ElasticSearch unassigned shard check - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-0 [05:57:07] 21:16.832Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:00:13] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2070 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2 [06:00:13] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:00:29] PROBLEM - ElasticSearch unassigned shard check - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:10:31] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2067 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:2 [06:10:31] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T0700). [07:00:05] robertsky and Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:04] *waves* [07:17:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca) [07:17:36] rescheduling for the late backport window [07:18:01] see ya then [07:30:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:31:21] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:19] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2111 is CRITICAL: CRITICAL - mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2 [07:44:19] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:41] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2109 is CRITICAL: CRITICAL - lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:00:12] Aca: yeah. apologies. was up in my neck on another matter. [08:04:19] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2090 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2 [08:04:19] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:10:13] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2063 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:2 [08:10:13] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:22:45] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:23:45] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:24:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:31:17] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 (owner: 10Volans) [08:31:45] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2082 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:51:37] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2105 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:2 [08:51:37] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:59:33] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2058 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:11] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2077 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2 [09:11:11] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:15:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=95%): /tmp 2121 MB (3% inode=95%): /var/tmp 2121 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [09:22:33] (03PS3) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) [09:28:12] (03PS1) 10Majavah: P:toolforge: redis_sentinel: Fix non-breaking spaces [puppet] - 10https://gerrit.wikimedia.org/r/1137726 [09:28:13] (03PS1) 10Majavah: P:toolforge: redis_sentinel: Don't try to set client name [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471) [09:29:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5321/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471) (owner: 10Majavah) [09:30:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:19] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2056 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2 [09:33:19] Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:56:11] (03PS2) 10Vgutierrez: wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1000) [10:11:25] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2103 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:12:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:14:19] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:20:19] PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2079 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:24] (03PS1) 10Majavah: hieradata: Drop old cloudinfra cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) [10:50:09] (03PS1) 10Majavah: Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1137729 [10:50:33] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/1137729 (owner: 10Majavah) [10:51:53] (03CR) 10Majavah: [V:03+2 C:03+2] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1137729 (owner: 10Majavah) [10:57:41] (03PS1) 10Majavah: hieradata: cloudinfra: Drop obsolete keys [puppet] - 10https://gerrit.wikimedia.org/r/1137730 [11:03:59] 06SRE, 06Infrastructure-Foundations: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10756789 (10taavi) [11:13:59] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1137730 (owner: 10Majavah) [11:14:19] (03CR) 10Majavah: [C:03+2] hieradata: cloudinfra: Drop obsolete keys [puppet] - 10https://gerrit.wikimedia.org/r/1137730 (owner: 10Majavah) [11:24:16] (03PS1) 10Majavah: Add WMCS v6 range to relevant exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137731 (https://phabricator.wikimedia.org/T386689) [11:46:50] (03PS1) 10Majavah: Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) [11:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:45] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:23:45] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:24:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:48:04] (03PS1) 10Majavah: hieradata: puppet-compiler: Drop obsolete key [puppet] - 10https://gerrit.wikimedia.org/r/1137747 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1300). [13:00:05] danisztls, robertsky, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] 👋 waves [13:00:27] I am here [13:01:27] o/ [13:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:47] i guess i can deploy [13:03:56] yay [13:04:00] awesome [13:04:01] :) [13:05:34] danisztls: is there intentionally a space at the end of the question name? [13:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:07:27] deploying the other two in the meantime [13:07:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [13:07:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca) [13:07:59] standing by [13:08:17] (03Merged) 10jenkins-bot: wikimaniawiki: update logo to 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [13:08:20] (03Merged) 10jenkins-bot: Enable mobile sitenotice for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca) [13:08:33] (03PS3) 10DDesouza: Design Research Participant Survey: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325) [13:08:35] taavi: not intentional, fixed [13:08:52] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]] [13:08:58] T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239 [13:08:58] T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334 [13:09:20] danisztls: thanks! will deploy yours once the current batch is out [13:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:26] taavi: ok, thanks! [13:15:56] the image build is taking a while :/ [13:16:19] :o hope it turns out well though. [13:18:48] (03CR) 10Majavah: [C:03+1] invisible-unicorn: Delete dns entries before removing proxy records [puppet] - 10https://gerrit.wikimedia.org/r/1137483 (https://phabricator.wikimedia.org/T391718) (owner: 10Andrew Bogott) [13:26:37] still doing something.. [13:29:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2057 to cirrussearch2057 [13:30:08] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:31:48] image build is finally complete, now it's continuing the deployment [13:32:02] ok [13:36:06] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2057 to cirrussearch2057 - bking@cumin2002" [13:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:36:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10756944 (10Jclark-ctr) No alerts for 4 days and temps and fan speeds have dropped closing this ticket for Temp The system inlet tempera... [13:36:43] !log taavi@deploy1003 robertsky, taavi, aleksandar: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:36:44] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2057 to cirrussearch2057 - bking@cumin2002" [13:36:44] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2057 [13:36:47] T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239 [13:36:48] T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334 [13:36:53] finally [13:36:57] Aca: robertsky: please test [13:36:57] checkin' [13:37:17] checking [13:37:22] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2057 [13:37:40] ok logo is updated [13:37:57] all's good. [13:38:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2057 to cirrussearch2057 [13:38:15] work as intended, lgtm [13:38:19] works* [13:38:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2057.codfw.wmnet with OS bullseye [13:38:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2057 [13:39:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:40:22] !log taavi@deploy1003 robertsky, taavi, aleksandar: Continuing with sync [13:43:22] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2057 - bking@cumin2002" [13:43:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2057 - bking@cumin2002" [13:43:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:28] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2057.codfw.wmnet 204.16.192.10.in-addr.arpa 4.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2057.codfw.wmnet 204.16.192.10.in-addr.arpa 4.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2057 [13:44:48] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2057 [13:44:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2057 [13:49:56] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]] (duration: 41m 04s) [13:50:01] T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239 [13:50:01] T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334 [13:50:14] there we go [13:50:18] jouncebot: nowandnext [13:50:18] For the next 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1300) [13:50:18] In 1 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1530) [13:50:38] danisztls: we're going to overrun the window, but I'm fine with that if you're still around [13:51:46] Thanks for the deploy! Will have to make it accessible now :) [13:51:54] taavi: I think the logo needs a purge in prod per https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Change_the_logo_of_a_Wikimedia_wiki [13:52:15] robertsky: right. one second [13:53:05] https://en.wikipedia.org/static/images/mobile/copyright/wikimaniawiki-wordmark.svg [13:53:07] the image ^ [13:53:28] taavi: thanks! [13:53:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [13:53:59] !log taavi@deploy1003 ~ $ echo "https://en.wikipedia.org/static/images/mobile/copyright/wikimaniawiki-wordmark.svg" | mwscript-k8s --attach purgeList.php -- --wiki enwiki [13:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:04] robertsky: done! [13:54:58] thanks. [13:57:49] hmm... the logo looks small on some pages on wikimaniawiki. [13:59:26] and now even CI is taking a while :/ [14:00:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2057.codfw.wmnet with reason: host reimage [14:02:38] PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158035 MB (4% inode=99%): /var/lib/hadoop/data/f 157621 MB (4% inode=99%): /var/lib/hadoop/data/j 157146 MB (4% inode=99%): /var/lib/hadoop/data/m 155766 MB (4% inode=99%): /var/lib/hadoop/data/h 156682 MB (4% inode=99%): /var/lib/hadoop/data/k 158039 MB (4% inode=99%): /var/lib/hadoop/data/e 160049 MB (4% inode=99%): /var/lib/hadoop/data [14:02:38] 2 MB (5% inode=99%): /var/lib/hadoop/data/b 155120 MB (4% inode=99%): /var/lib/hadoop/data/d 149539 MB (3% inode=99%): /var/lib/hadoop/data/i 154128 MB (4% inode=99%): /var/lib/hadoop/data/l 151826 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [14:03:28] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10756968 (10Jhancock.wm) 05Open→03Resolved replacement DIMM received. defective returned. [14:04:33] (03Merged) 10jenkins-bot: Design Research Participant Survey: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [14:04:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2057.codfw.wmnet with reason: host reimage [14:04:46] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]] [14:04:49] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [14:08:31] taavi: any idea why the logo is downscaled for logged out visitors? the logo on 2025:Wikimania is at the right size if it is clicked through the sidebar's Main page link. https://postimg.cc/gallery/06qw6Kt/318ad600 [14:09:10] !log taavi@deploy1003 taavi, dani: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:12] robertsky: i've no idea, sorry [14:09:15] danisztls: please test [14:09:21] (is there anything to test?) [14:11:41] taavi: looks good [14:12:02] !log taavi@deploy1003 taavi, dani: Continuing with sync [14:12:05] thanks, syncing [14:14:26] taavi: thanks! [14:16:43] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366 (10Andrew) 03NEW [14:17:01] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2065 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [14:18:17] (03PS1) 10Majavah: P:wmcs: cloudgw: Refuse outbound mail via NAT [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) [14:18:19] (03PS1) 10Majavah: P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) [14:18:35] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: put into service [puppet] - 10https://gerrit.wikimedia.org/r/1137759 (https://phabricator.wikimedia.org/T392366) [14:19:17] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: put into service [puppet] - 10https://gerrit.wikimedia.org/r/1137759 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott) [14:19:39] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]] (duration: 14m 53s) [14:19:43] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [14:20:27] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2104 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [14:22:14] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: network hiera setup [puppet] - 10https://gerrit.wikimedia.org/r/1137760 (https://phabricator.wikimedia.org/T392366) [14:22:43] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: network hiera setup [puppet] - 10https://gerrit.wikimedia.org/r/1137760 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott) [14:23:49] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2097 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1352, active_shards: 4179, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 3, delayed_unassigned_shards: 0, number_of_pending [14:23:49] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 34897, active_shards_percent_as_number: 99.92826398852223 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:23:49] RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2097 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: yellow, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1704, active_shards: 5088, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 23, delayed_unassigned_shards: 0, numb [14:23:49] nding_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.54999021717863 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:24:26] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: correct hiera typo in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1137761 [14:24:54] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: correct hiera typo in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1137761 (owner: 10Andrew Bogott) [14:25:01] taavi: got a theory. somehow the system is still holding on to the old dimensions of the previous wordmark. [14:25:15] just not sure where to go from here. [14:28:22] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [14:28:33] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmn... [14:29:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:29:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2057.codfw.wmnet with OS bullseye [14:30:09] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1137762 [14:31:34] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1137762 (owner: 10Andrew Bogott) [14:32:50] nvm. it looks ok now. [14:33:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:25] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:44:41] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2107 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [14:45:55] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:19] taavi: around still? are the logo cached differently for mobile site? [14:46:27] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [14:47:12] ah. nvm. [14:47:24] I think my ISP is caching instead. [14:49:44] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [14:52:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:56:13] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic2078\.codfw\.wmnet [14:57:32] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [14:57:39] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye execut... [15:01:39] (03PS1) 10Andrew Bogott: Revert "cloudcephosd2004-dev: put into service" [puppet] - 10https://gerrit.wikimedia.org/r/1137763 (https://phabricator.wikimedia.org/T392366) [15:01:41] (03PS1) 10Andrew Bogott: Revert "Revert "cloudcephosd2004-dev: put into service"" [puppet] - 10https://gerrit.wikimedia.org/r/1137764 (https://phabricator.wikimedia.org/T392366) [15:01:55] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:02:09] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudcephosd2004-dev: put into service" [puppet] - 10https://gerrit.wikimedia.org/r/1137763 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott) [15:02:41] FIRING: [3x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:03:09] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [15:03:25] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmn... [15:04:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:09:08] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=elastic2078\.codfw\.wmnet [15:09:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:10:21] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2074 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [15:12:50] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:21:12] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [15:23:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1530). nyaa~ [15:33:42] (03PS1) 10Bking: cirrussearch: update conftool with correct pool data [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) [15:34:45] (03CR) 10Bking: "here's the list of cirrussearch hosts, note that we don't want to add 2079 as it has failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:35:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) [15:36:39] (03CR) 10Ebernhardson: [C:03+1] "verified changed hosts are only available in dns (and pingable) with new names" [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:50] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2064 to cirrussearch2064 [15:38:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:38:31] (03CR) 10Bking: [C:03+2] cirrussearch: update conftool with correct pool data [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:40:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:56] (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) [15:41:17] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2114 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [15:41:21] (03CR) 10CI reject: [V:04-1] openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [15:42:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [15:42:13] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmnet w... [15:42:41] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:42:49] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2115 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [15:42:52] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2064 to cirrussearch2064 - bking@cumin2002" [15:43:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2064 to cirrussearch2064 - bking@cumin2002" [15:43:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:09] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2064 [15:43:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2064 [15:43:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2064 to cirrussearch2064 [15:45:59] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057 [15:46:01] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) [15:46:12] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [15:47:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [15:47:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [15:47:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2064.codfw.wmnet with OS bullseye [15:47:51] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2064 [15:48:01] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:50:24] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) [15:50:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:50:49] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) [15:52:11] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057\.codfw\.wmnet [15:53:22] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2064 - bking@cumin2002" [15:53:27] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2064 - bking@cumin2002" [15:53:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:28] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2064.codfw.wmnet 109.16.192.10.in-addr.arpa 9.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:53:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2064.codfw.wmnet 109.16.192.10.in-addr.arpa 9.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:53:32] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2064 [15:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:42] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2064 [15:53:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2064 [15:58:04] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) [15:58:22] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) [15:58:29] (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "cloudcephosd2004-dev: put into service"" [puppet] - 10https://gerrit.wikimedia.org/r/1137764 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott) [15:58:38] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [15:58:39] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2056\.codfw\.wmnet [15:58:41] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057\.codfw\.wmnet [15:58:43] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2058\.codfw\.wmnet [15:58:45] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2059\.codfw\.wmnet [15:59:36] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2060\.codfw\.wmnet [15:59:39] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2062\.codfw\.wmnet [15:59:41] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2063\.codfw\.wmnet [15:59:43] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2065\.codfw\.wmnet [15:59:46] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2066\.codfw\.wmnet [15:59:48] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2067\.codfw\.wmnet [15:59:51] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2068\.codfw\.wmnet [15:59:53] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2069\.codfw\.wmnet [15:59:55] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2070\.codfw\.wmnet [15:59:57] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2072\.codfw\.wmnet [16:00:00] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2074\.codfw\.wmnet [16:00:02] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2075\.codfw\.wmnet [16:00:05] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2077\.codfw\.wmnet [16:00:07] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2079\.codfw\.wmnet [16:00:10] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2082\.codfw\.wmnet [16:00:12] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2085\.codfw\.wmnet [16:00:14] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2087\.codfw\.wmnet [16:00:17] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2088\.codfw\.wmnet [16:00:19] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2089\.codfw\.wmnet [16:00:22] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2090\.codfw\.wmnet [16:00:27] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2091\.codfw\.wmnet [16:00:30] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2097\.codfw\.wmnet [16:00:48] sorry for the spam, just realized it was logging each one [16:00:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:44] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2103.codfw.wmnet|cirrussearch2104.codfw.wmnet|cirrussearch2105.codfw.wmnet|cirrussearch2107.codfw.wmnet|cirrussearch2109.codfw.wmnet|cirrussearch2111.codfw.wmnet|cirrussearch2112.codfw.wmnet|cirrussearch2114.codfw.wmnet|cirrussearch2115.codfw.wmnet [16:05:18] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1137798 (https://phabricator.wikimedia.org/T392366) [16:05:54] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [16:06:16] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1137798 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott) [16:09:03] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage [16:11:35] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1030.eqiad.wmnet with reason: Decommissioning — T378725 [16:11:36] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2089 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [16:11:38] T378725: Refresh aqs1013 w/ aqs1022 - https://phabricator.wikimedia.org/T378725 [16:12:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage [16:13:31] !log decommissioning Cassandra/restbase1030-{a,b,c} — T389423 [16:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:35] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [16:14:42] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [16:20:26] PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:21:44] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2059 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:22:28] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:23:45] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:25:26] RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:28:14] PROBLEM - Hadoop NodeManager on an-worker1191 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2064.codfw.wmnet with OS bullseye [16:38:28] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:36] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2088 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:42:38] PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158815 MB (4% inode=99%): /var/lib/hadoop/data/f 156640 MB (4% inode=99%): /var/lib/hadoop/data/j 158068 MB (4% inode=99%): /var/lib/hadoop/data/m 154727 MB (4% inode=99%): /var/lib/hadoop/data/h 157595 MB (4% inode=99%): /var/lib/hadoop/data/k 159019 MB (4% inode=99%): /var/lib/hadoop/data/e 159428 MB (4% inode=99%): /var/lib/hadoop/data [16:42:38] 3 MB (5% inode=99%): /var/lib/hadoop/data/b 154444 MB (4% inode=99%): /var/lib/hadoop/data/d 153665 MB (4% inode=99%): /var/lib/hadoop/data/i 154340 MB (4% inode=99%): /var/lib/hadoop/data/l 144922 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [16:49:14] RECOVERY - Hadoop NodeManager on an-worker1191 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:00] (03PS1) 10Andrew Bogott: invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803 [16:52:50] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [16:57:20] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2066 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1700) [17:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1700). [17:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:44] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [17:18:02] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2094 to cirrussearch2094 [17:18:25] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:19:04] (03CR) 10Dzahn: "does not affect hosts serving traffic - the compiler failure isn't real - it's a case of "only works with the change" https://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:20:07] (03CR) 10Dzahn: [C:03+2] phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:22:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2094 to cirrussearch2094 - bking@cumin2002" [17:22:59] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2094 to cirrussearch2094 - bking@cumin2002" [17:22:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:00] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2094 [17:23:18] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2094 [17:23:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2094 to cirrussearch2094 [17:24:34] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2094.codfw.wmnet with OS bullseye [17:24:45] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2094 [17:24:50] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:24:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2050.codfw.wmnet with OS bookworm [17:24:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm [17:30:29] bking@cumin2002 reimage (PID 3444540) is awaiting input [17:32:18] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:34:06] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2085 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [17:35:18] PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:36:20] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:36:20] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:36:20] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:36:24] RECOVERY - Host msw1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms [17:36:26] RECOVERY - Host ps1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [17:36:28] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.15 ms [17:36:28] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms [17:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:38:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage [17:41:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage [17:42:31] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2094 - bking@cumin2002" [17:42:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2094 - bking@cumin2002" [17:42:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2094.codfw.wmnet 230.16.192.10.in-addr.arpa 0.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2094.codfw.wmnet 230.16.192.10.in-addr.arpa 0.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2094 [17:42:52] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2094 [17:42:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2094 [17:46:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:47:19] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:49:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [17:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:55:57] PROBLEM - Hadoop NodeManager on an-worker1187 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:56:57] PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:57:05] RECOVERY - ElasticSearch unassigned shard check - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [17:57:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:59:20] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage [18:00:13] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2070 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [18:00:29] jhancock@cumin2002 reimage (PID 3444753) is awaiting input [18:00:29] RECOVERY - ElasticSearch unassigned shard check - 9643 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:02:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage [18:04:56] RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:05:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:05:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2050.codfw.wmnet with OS bookworm [18:05:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm completed: - gane... [18:10:30] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2067 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [18:18:00] (03CR) 10Majavah: [C:04-1] invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott) [18:19:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757498 (10Jhancock.wm) figured it out. gonna finish the rest this evening :fingers-crossed: [18:20:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757501 (10Jhancock.wm) [18:22:01] (03PS2) 10Andrew Bogott: invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803 [18:22:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2094.codfw.wmnet with OS bullseye [18:22:56] RECOVERY - Hadoop NodeManager on an-worker1187 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:26:40] (03PS4) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) [18:27:23] (03PS1) 10Jforrester: ZString: Don't explode if we're handed an array with odd contents [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137813 (https://phabricator.wikimedia.org/T392370) [18:32:41] (03PS5) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) [18:32:49] (03CR) 10CI reject: [V:04-1] Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [18:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:46] (03PS6) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) [19:08:59] (03PS1) 10Jdrewniak: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 [19:23:42] PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:33:07] (03CR) 10Bearloga: Create EventStream configuration for PES1.3 Wikirun Game (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak) [19:39:44] (03PS2) 10Jdrewniak: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 [19:40:24] (03PS1) 10Bernard Wang: Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) [19:40:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [19:40:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [19:44:20] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2111 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [19:45:13] (03PS1) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [19:46:01] (03PS2) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [19:46:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak) [19:47:04] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2095 to cirrussearch2095 [19:47:26] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:48:16] (03CR) 10LorenMora: [C:03+1] Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [19:48:50] (03CR) 10Dillon: [C:03+1] Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [19:49:42] RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:51:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2095 to cirrussearch2095 - bking@cumin2002" [19:51:52] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2095 to cirrussearch2095 - bking@cumin2002" [19:51:53] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:53] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2095 [19:52:03] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2095 [19:52:43] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2095 to cirrussearch2095 [19:52:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:53:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2095.codfw.wmnet with OS bullseye [19:53:26] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2095 [19:53:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:42] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2109 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [19:57:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:58:10] (03PS4) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [19:58:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:59:49] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2095 - bking@cumin2002" [19:59:55] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2095 - bking@cumin2002" [19:59:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:56] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2095.codfw.wmnet 232.16.192.10.in-addr.arpa 2.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:59:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2095.codfw.wmnet 232.16.192.10.in-addr.arpa 2.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:00:00] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2095 [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2000). [20:00:04] danisztls, bwang, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2095 [20:00:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2095 [20:01:08] o/ [20:01:52] o/ [20:02:05] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:04:19] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2090 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [20:06:17] Hello! Here for the window [20:06:32] hey danisztls , bwang , looks like it's just config changes, I can do the deploy today :) [20:08:05] I'm going to do all three at once, I don't think there's a dependency between any of them. [20:09:25] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2110 to cirrussearch2110 [20:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [20:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak) [20:10:07] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:10:13] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2063 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [20:10:32] (03Merged) 10jenkins-bot: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:10:33] bwang: is the patch on the schedule just duplicated, or should there be a different patch there too? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2000 [20:10:35] (03Merged) 10jenkins-bot: Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [20:10:42] (03Merged) 10jenkins-bot: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak) [20:10:57] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]] [20:11:02] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:11:02] T390881: Enable extension:ReadingList as beta feature on beta cluster - https://phabricator.wikimedia.org/T390881 [20:12:52] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1137818/5323/" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:14:12] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2110 to cirrussearch2110 - bking@cumin2002" [20:14:57] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2110 to cirrussearch2110 - bking@cumin2002" [20:14:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:14:58] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2110 [20:15:23] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2110 [20:15:43] !log jdrewniak@deploy1003 jdrewniak, dani, bwang: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2110 to cirrussearch2110 [20:16:30] danisztls, bwang changes are ready to test on mwdebug [20:17:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage [20:18:12] jan_drewniak: looks good [20:18:17] Ok checking [20:20:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2110.codfw.wmnet with OS bullseye [20:20:11] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10757776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2110.codfw.w... [20:20:20] I think I having issues with my wikimediadebug extension... [20:20:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage [20:22:59] !log jdrewniak@deploy1003 jdrewniak, dani, bwang: Continuing with sync [20:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:25:01] (03PS5) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [20:26:56] (03PS6) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [20:27:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:29:41] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]] (duration: 18m 44s) [20:29:46] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:29:47] T390881: Enable extension:ReadingList as beta feature on beta cluster - https://phabricator.wikimedia.org/T390881 [20:31:46] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2082 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [20:32:55] Ok backport sync done [20:33:26] jan_drewniak: thanks! [20:36:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage [20:38:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage [20:40:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2095.codfw.wmnet with OS bullseye [20:51:37] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2105 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [20:59:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2110.codfw.wmnet with OS bullseye [20:59:33] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2058 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [20:59:34] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10757835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2110.codfw.wmnet... [21:00:04] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2100). Please do the needful. [21:03:23] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2096 to cirrussearch2096 [21:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:45] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:08:05] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2096 to cirrussearch2096 - bking@cumin2002" [21:11:10] bking@cumin2002 rename (PID 3675984) is awaiting input [21:11:11] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2077 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [21:14:32] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2096 to cirrussearch2096 - bking@cumin2002" [21:14:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:14:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2096 [21:14:42] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2096 [21:15:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2096 to cirrussearch2096 [21:23:10] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2071 to cirrussearch2071 [21:23:33] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:27:48] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2071 to cirrussearch2071 - bking@cumin2002" [21:28:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2071 to cirrussearch2071 - bking@cumin2002" [21:28:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:36] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2071 [21:28:55] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2071 [21:29:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2071 to cirrussearch2071 [21:31:25] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2071.codfw.wmnet with OS bullseye [21:31:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2071 [21:31:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:33:19] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2056 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [21:34:20] (03PS1) 10Ryan Kemper: rolling-operation: (proof of concept) manually output commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1137824 [21:35:49] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2071 - bking@cumin2002" [21:35:54] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2071 - bking@cumin2002" [21:35:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:35:55] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2071.codfw.wmnet 70.32.192.10.in-addr.arpa 0.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:35:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2071.codfw.wmnet 70.32.192.10.in-addr.arpa 0.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:35:59] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2071 [21:36:20] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2071 [21:36:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2071 [21:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:37:09] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:37:38] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610 [21:37:42] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:37:43] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610 [21:38:52] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610 [21:41:40] (03CR) 10CI reject: [V:04-1] rolling-operation: (proof of concept) manually output commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1137824 (owner: 10Ryan Kemper) [21:44:21] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:46:40] (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [21:46:44] (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [21:46:49] (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak) [21:51:22] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage [21:55:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage [21:55:55] (03CR) 10Dzahn: "in the scap::target class, the relevant line is "home => "/var/lib/${deploy_user}". This is the case if $manage_user is set to true." [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:57:12] (03PS3) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [21:59:03] (03CR) 10CI reject: [V:04-1] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:03:53] (03PS4) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [22:05:45] (03CR) 10CI reject: [V:04-1] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:06:52] (03PS5) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [22:08:28] (03CR) 10Dzahn: "puppet compiler run linked above ran on "C:scap" so it picked one host for each regex in site.pp with roles that include scap and showed n" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:11:25] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2103 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [22:12:47] (03PS1) 10Dzahn: phabricator/scap: disable scap bootstrapping on phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) [22:15:31] (03CR) 10DDesouza: "Sorry about that though I don't think this config change is the issue. The code is similar to previous deployments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [22:20:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2071.codfw.wmnet with OS bullseye [22:20:19] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2079 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:37] RECOVERY - Disk space on an-worker1139 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [22:30:22] (03CR) 10Dzahn: "ACK! I just said that because it seemed they were updated in the same minute when I looked at the repo. Sorry as well for the noise. It's " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [22:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:36] (03CR) 10Dzahn: [C:03+2] "just to not leave puppet broken on a host in setup" [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:40:27] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2055 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [22:42:04] that alert on crm2001 is both known and weird. known because it's https://phabricator.wikimedia.org/T383715 and WIP but also weird because it links to alerts.wikimedia.org where it does not show up.. it's silenced yet still talks on IRC.. [22:44:56] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610 [22:45:00] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:45:52] (03CR) 10Dzahn: [C:03+2] "removed errors about bootstrapping but we still have "Package[phabricator/deployment]: Provider scap3 is not functional on this host"" [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:47:54] (03PS1) 10Dzahn: phabricator: comment out scap::target in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889) [22:49:05] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2060 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [22:52:29] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1137830/5326/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:52:30] (03CR) 10Dzahn: [C:03+2] phabricator: comment out scap::target in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2300) [23:22:31] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2087 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:39:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835 [23:39:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835 (owner: 10TrainBranchBot) [23:40:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:49:26] jouncebot: nowandnext [23:49:27] For the next 0 hour(s) and 10 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2300) [23:49:27] In 2 hour(s) and 10 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0200) [23:50:40] (03PS1) 10Reedy: InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837 [23:51:16] (03CR) 10Reedy: Enable reading list beta feature for beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang) [23:51:34] (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837 (owner: 10Reedy) [23:52:01] RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2112 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:52:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835 (owner: 10TrainBranchBot) [23:52:21] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837 (owner: 10Reedy) [23:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:53] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2069 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:58:11] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2072 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration