[00:09:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570
[00:09:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570 (owner: 10TrainBranchBot)
[00:10:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:22:45] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[00:23:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[00:25:17] <icinga-wm>	 PROBLEM - Disk space on analytics1070 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=95%): /tmp 2121 MB (3% inode=95%): /var/tmp 2121 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops
[00:30:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137570 (owner: 10TrainBranchBot)
[00:51:37] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9f101cedcb0befe856cdbcefb8a70401e9822d3e23a7ba6b169df3aa01fa0155/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:08:35] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2062 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:11:37] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:20:25] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:23:21] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:52:27] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:53:23] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:12:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:17:01] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2065 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2
[02:17:01] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:20:27] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2104 is CRITICAL: CRITICAL - ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:2
[02:20:27] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:35:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:44:41] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2107 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[03:03:39] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:03:47] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:05:07] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:05:31] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53801 bytes in 2.442 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:05:37] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:05:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:10:21] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2074 is CRITICAL: CRITICAL - mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:2
[03:10:21] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[03:23:43] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2114 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2
[03:23:43] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[03:42:49] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2115 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[03:53:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:11:35] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2089 is CRITICAL: CRITICAL - lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:14:41] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2068 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:21:43] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2059 is CRITICAL: CRITICAL - kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2
[04:21:43] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:22:45] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[04:23:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:40:35] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2088 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:2
[04:40:35] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:50:21] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:51:21] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:52:51] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2075 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:57:19] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2066 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:11:39] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:11:47] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:07] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:15:01] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:29] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:37] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:34:05] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2085 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:54:23] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:54:27] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:55:27] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:56:21] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:57:07] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-0
[05:57:07] <icinga-wm>	 21:16.832Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:00:13] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2070 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2
[06:00:13] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:00:29] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:10:31] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2067 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:2
[06:10:31] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:35:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T0700).
[07:00:05] <jouncebot>	 robertsky and Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:04] <Aca>	 *waves*
[07:17:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca)
[07:17:36] <Aca>	 rescheduling for the late backport window
[07:18:01] <Aca>	 see ya then
[07:30:21] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:31:21] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:44:19] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2111 is CRITICAL: CRITICAL - mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:2
[07:44:19] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:53:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:56:41] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2109 is CRITICAL: CRITICAL - lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:00:12] <robertsky>	 Aca: yeah. apologies. was up in my neck on another matter.
[08:04:19] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2090 is CRITICAL: CRITICAL - bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2
[08:04:19] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:10:13] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2063 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:2
[08:10:13] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:22:45] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[08:23:45] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:31:17] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 (owner: 10Volans)
[08:31:45] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2082 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:51:37] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2105 is CRITICAL: CRITICAL - cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), mniwiktionary_content_1728016501[0](2025-04-17T22:21:20.033Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:2
[08:51:37] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:59:33] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2058 is CRITICAL: CRITICAL - brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:11:11] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2077 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2
[09:11:11] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:15:49] <icinga-wm>	 PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=95%): /tmp 2121 MB (3% inode=95%): /var/tmp 2121 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops
[09:22:33] <wikibugs>	 (03PS3) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563)
[09:28:12] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: redis_sentinel: Fix non-breaking spaces [puppet] - 10https://gerrit.wikimedia.org/r/1137726
[09:28:13] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: redis_sentinel: Don't try to set client name [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471)
[09:29:27] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:29:57] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5321/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471) (owner: 10Majavah)
[09:30:23] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:33:19] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch2056 is CRITICAL: CRITICAL - cowiki_general_1728063266[0](2025-04-17T22:21:16.304Z), kuwiki_general_1728075589[0](2025-04-17T22:22:45.143Z), cywikibooks_content_1728117258[0](2025-04-17T22:21:07.230Z), ukwiktionary_general_1727979466[0](2025-04-17T22:21:35.775Z), bat_smgwiki_general_1728063421[0](2025-04-17T22:21:16.832Z), mniwiktionary_content_1728016501[0](2025-04-17T22:2
[09:33:19] <icinga-wm>	 Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:56:11] <wikibugs>	 (03PS2) 10Vgutierrez: wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1000)
[10:11:25] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2103 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:12:21] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:14:19] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:20:19] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9600 on cirrussearch2079 is CRITICAL: CRITICAL - nlwikimedia_general_1728077255[0](2025-04-17T22:21:15.926Z), brwiktionary_content_1727984886[0](2025-04-17T22:21:31.970Z), lvwikibooks_general_1728069176[0](2025-04-17T22:21:08.499Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:35:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:48:24] <wikibugs>	 (03PS1) 10Majavah: hieradata: Drop old cloudinfra cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725)
[10:50:09] <wikibugs>	 (03PS1) 10Majavah: Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1137729
[10:50:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/1137729 (owner: 10Majavah)
[10:51:53] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1137729 (owner: 10Majavah)
[10:57:41] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudinfra: Drop obsolete keys [puppet] - 10https://gerrit.wikimedia.org/r/1137730
[11:03:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10756789 (10taavi)
[11:13:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1137730 (owner: 10Majavah)
[11:14:19] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: cloudinfra: Drop obsolete keys [puppet] - 10https://gerrit.wikimedia.org/r/1137730 (owner: 10Majavah)
[11:24:16] <wikibugs>	 (03PS1) 10Majavah: Add WMCS v6 range to relevant exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137731 (https://phabricator.wikimedia.org/T386689)
[11:46:50] <wikibugs>	 (03PS1) 10Majavah: Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689)
[11:53:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:22:45] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:23:45] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[12:48:04] <wikibugs>	 (03PS1) 10Majavah: hieradata: puppet-compiler: Drop obsolete key [puppet] - 10https://gerrit.wikimedia.org/r/1137747
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1300).
[13:00:05] <jouncebot>	 danisztls, robertsky, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:24] <Aca>	 👋 waves
[13:00:27] <robertsky>	 I am here
[13:01:27] <danisztls>	 o/
[13:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:03:47] <taavi>	 i guess i can deploy
[13:03:56] <robertsky>	 yay
[13:04:00] <Aca>	 awesome
[13:04:01] <robertsky>	 :)
[13:05:34] <taavi>	 danisztls: is there intentionally a space at the end of the question name?
[13:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:07:27] <taavi>	 deploying the other two in the meantime
[13:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky)
[13:07:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca)
[13:07:59] <robertsky>	 standing by
[13:08:17] <wikibugs>	 (03Merged) 10jenkins-bot: wikimaniawiki: update logo to 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky)
[13:08:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mobile sitenotice for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137510 (https://phabricator.wikimedia.org/T392334) (owner: 10Acamicamacaraca)
[13:08:33] <wikibugs>	 (03PS3) 10DDesouza: Design Research Participant Survey: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325)
[13:08:35] <danisztls>	 taavi: not intentional, fixed
[13:08:52] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]]
[13:08:58] <stashbot>	 T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239
[13:08:58] <stashbot>	 T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334
[13:09:20] <taavi>	 danisztls: thanks! will deploy yours once the current batch is out
[13:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:10:26] <danisztls>	 taavi: ok, thanks!
[13:15:56] <taavi>	 the image build is taking a while :/
[13:16:19] <robertsky>	 :o hope it turns out well though.
[13:18:48] <wikibugs>	 (03CR) 10Majavah: [C:03+1] invisible-unicorn: Delete dns entries before removing proxy records [puppet] - 10https://gerrit.wikimedia.org/r/1137483 (https://phabricator.wikimedia.org/T391718) (owner: 10Andrew Bogott)
[13:26:37] <taavi>	 still doing something..
[13:29:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2057 to cirrussearch2057
[13:30:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:31:48] <taavi>	 image build is finally complete, now it's continuing the deployment
[13:32:02] <robertsky>	 ok
[13:36:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2057 to cirrussearch2057 - bking@cumin2002"
[13:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:36:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10756944 (10Jclark-ctr) No alerts for 4 days and temps and fan speeds have dropped  closing this ticket for Temp  The system inlet tempera...
[13:36:43] <logmsgbot>	 !log taavi@deploy1003 robertsky, taavi, aleksandar: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:36:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2057 to cirrussearch2057 - bking@cumin2002"
[13:36:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:36:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2057
[13:36:47] <stashbot>	 T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239
[13:36:48] <stashbot>	 T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334
[13:36:53] <taavi>	 finally
[13:36:57] <taavi>	 Aca: robertsky: please test
[13:36:57] <Aca>	 checkin'
[13:37:17] <robertsky>	 checking
[13:37:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2057
[13:37:40] <robertsky>	 ok logo is updated
[13:37:57] <robertsky>	 all's good.
[13:38:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2057 to cirrussearch2057
[13:38:15] <Aca>	 work as intended, lgtm
[13:38:19] <Aca>	 works*
[13:38:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2057.codfw.wmnet with OS bullseye
[13:38:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2057
[13:39:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:40:22] <logmsgbot>	 !log taavi@deploy1003 robertsky, taavi, aleksandar: Continuing with sync
[13:43:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2057 - bking@cumin2002"
[13:43:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2057 - bking@cumin2002"
[13:43:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:43:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2057.codfw.wmnet 204.16.192.10.in-addr.arpa 4.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:43:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2057.codfw.wmnet 204.16.192.10.in-addr.arpa 4.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:43:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2057
[13:44:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2057
[13:44:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2057
[13:49:56] <logmsgbot>	 !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137349|wikimaniawiki: update logo to 2025 (T392239)]], [[gerrit:1137510|Enable mobile sitenotice for shwiki (T392334)]] (duration: 41m 04s)
[13:50:01] <stashbot>	 T392239: wikimaniawiki: update to 2025 wordmark - https://phabricator.wikimedia.org/T392239
[13:50:01] <stashbot>	 T392334: Enable Sitenotice in mobile view on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T392334
[13:50:14] <taavi>	 there we go
[13:50:18] <taavi>	 jouncebot: nowandnext
[13:50:18] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1300)
[13:50:18] <jouncebot>	 In 1 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1530)
[13:50:38] <taavi>	 danisztls: we're going to overrun the window, but I'm fine with that if you're still around
[13:51:46] <Aca>	 Thanks for the deploy! Will have to make it accessible now :)
[13:51:54] <robertsky>	 taavi: I think the logo needs a purge in prod per https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Change_the_logo_of_a_Wikimedia_wiki
[13:52:15] <taavi>	 robertsky: right. one second
[13:53:05] <robertsky>	 https://en.wikipedia.org/static/images/mobile/copyright/wikimaniawiki-wordmark.svg
[13:53:07] <robertsky>	 the image ^
[13:53:28] <danisztls>	 taavi: thanks!
[13:53:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[13:53:59] <taavi>	 !log taavi@deploy1003 ~ $ echo "https://en.wikipedia.org/static/images/mobile/copyright/wikimaniawiki-wordmark.svg" | mwscript-k8s --attach purgeList.php -- --wiki enwiki
[13:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:04] <taavi>	 robertsky: done!
[13:54:58] <robertsky>	 thanks.
[13:57:49] <robertsky>	 hmm... the logo looks small on some pages on wikimaniawiki.
[13:59:26] <taavi>	 and now even CI is taking a while :/
[14:00:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2057.codfw.wmnet with reason: host reimage
[14:02:38] <icinga-wm>	 PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158035 MB (4% inode=99%): /var/lib/hadoop/data/f 157621 MB (4% inode=99%): /var/lib/hadoop/data/j 157146 MB (4% inode=99%): /var/lib/hadoop/data/m 155766 MB (4% inode=99%): /var/lib/hadoop/data/h 156682 MB (4% inode=99%): /var/lib/hadoop/data/k 158039 MB (4% inode=99%): /var/lib/hadoop/data/e 160049 MB (4% inode=99%): /var/lib/hadoop/data
[14:02:38] <icinga-wm>	 2 MB (5% inode=99%): /var/lib/hadoop/data/b 155120 MB (4% inode=99%): /var/lib/hadoop/data/d 149539 MB (3% inode=99%): /var/lib/hadoop/data/i 154128 MB (4% inode=99%): /var/lib/hadoop/data/l 151826 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops
[14:03:28] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10756968 (10Jhancock.wm) 05Open→03Resolved replacement DIMM received. defective returned.
[14:04:33] <wikibugs>	 (03Merged) 10jenkins-bot: Design Research Participant Survey: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137567 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[14:04:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2057.codfw.wmnet with reason: host reimage
[14:04:46] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]]
[14:04:49] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[14:08:31] <robertsky>	 taavi: any idea why the logo is downscaled for logged out visitors? the logo on 2025:Wikimania is at the right size if it is clicked through the sidebar's Main page link. https://postimg.cc/gallery/06qw6Kt/318ad600
[14:09:10] <logmsgbot>	 !log taavi@deploy1003 taavi, dani: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:12] <taavi>	 robertsky: i've no idea, sorry
[14:09:15] <taavi>	 danisztls: please test
[14:09:21] <taavi>	 (is there anything to test?)
[14:11:41] <danisztls>	 taavi: looks good
[14:12:02] <logmsgbot>	 !log taavi@deploy1003 taavi, dani: Continuing with sync
[14:12:05] <taavi>	 thanks, syncing
[14:14:26] <danisztls>	 taavi: thanks!
[14:16:43] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366 (10Andrew) 03NEW
[14:17:01] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2065 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:18:17] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: cloudgw: Refuse outbound mail via NAT [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936)
[14:18:19] <wikibugs>	 (03PS1) 10Majavah: P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935)
[14:18:35] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: put into service [puppet] - 10https://gerrit.wikimedia.org/r/1137759 (https://phabricator.wikimedia.org/T392366)
[14:19:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: put into service [puppet] - 10https://gerrit.wikimedia.org/r/1137759 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott)
[14:19:39] <logmsgbot>	 !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137567|Design Research Participant Survey: Pre-deploy (T392325)]] (duration: 14m 53s)
[14:19:43] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[14:20:27] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2104 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:22:14] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: network hiera setup [puppet] - 10https://gerrit.wikimedia.org/r/1137760 (https://phabricator.wikimedia.org/T392366)
[14:22:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: network hiera setup [puppet] - 10https://gerrit.wikimedia.org/r/1137760 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott)
[14:23:49] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2097 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1352, active_shards: 4179, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 3, delayed_unassigned_shards: 0, number_of_pending
[14:23:49] <icinga-wm>	 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 34897, active_shards_percent_as_number: 99.92826398852223 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:23:49] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2097 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: yellow, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1704, active_shards: 5088, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 23, delayed_unassigned_shards: 0, numb
[14:23:49] <icinga-wm>	 nding_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.54999021717863 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:24:26] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: correct hiera typo in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1137761
[14:24:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: correct hiera typo in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1137761 (owner: 10Andrew Bogott)
[14:25:01] <robertsky>	 taavi: got a theory. somehow the system is still holding on to the old dimensions of the previous wordmark.
[14:25:15] <robertsky>	 just not sure where to go from here.
[14:28:22] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[14:28:33] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmn...
[14:29:28] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:29:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2057.codfw.wmnet with OS bullseye
[14:30:09] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1137762
[14:31:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1137762 (owner: 10Andrew Bogott)
[14:32:50] <robertsky>	 nvm. it looks ok now.
[14:33:29] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:34:25] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:35:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:42:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:44:41] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2107 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:45:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:46:19] <robertsky>	 taavi: around still? are the logo cached differently for mobile site?
[14:46:27] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage
[14:47:12] <robertsky>	 ah. nvm.
[14:47:24] <robertsky>	 I think my ISP is caching instead.
[14:49:44] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage
[14:52:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[14:56:13] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic2078\.codfw\.wmnet
[14:57:32] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[14:57:39] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye execut...
[15:01:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cloudcephosd2004-dev: put into service" [puppet] - 10https://gerrit.wikimedia.org/r/1137763 (https://phabricator.wikimedia.org/T392366)
[15:01:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Revert "cloudcephosd2004-dev: put into service"" [puppet] - 10https://gerrit.wikimedia.org/r/1137764 (https://phabricator.wikimedia.org/T392366)
[15:01:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudcephosd2004-dev: put into service" [puppet] - 10https://gerrit.wikimedia.org/r/1137763 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott)
[15:02:41] <jinxer-wm>	 FIRING: [3x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:03:09] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[15:03:25] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmn...
[15:04:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[15:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:09:08] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=elastic2078\.codfw\.wmnet
[15:09:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[15:10:21] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2074 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:12:50] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:21:12] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage
[15:23:56] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage
[15:30:05] <jouncebot>	 jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1530). nyaa~
[15:33:42] <wikibugs>	 (03PS1) 10Bking: cirrussearch: update conftool with correct pool data [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610)
[15:34:45] <wikibugs>	 (03CR) 10Bking: "here's the list of cirrussearch hosts, note that we don't want to add 2079 as it has failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[15:35:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:35:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325)
[15:36:39] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "verified changed hosts are only available in dns (and pingable) with new names" [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2064 to cirrussearch2064
[15:38:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:38:31] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: update conftool with correct pool data [puppet] - 10https://gerrit.wikimedia.org/r/1137782 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[15:40:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:40:56] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325)
[15:41:17] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2114 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:41:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[15:42:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[15:42:13] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366#10757237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd2004-dev.codfw.wmnet w...
[15:42:41] <jinxer-wm>	 RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:42:49] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2115 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:42:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2064 to cirrussearch2064 - bking@cumin2002"
[15:43:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2064 to cirrussearch2064 - bking@cumin2002"
[15:43:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:43:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2064
[15:43:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2064
[15:43:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2064 to cirrussearch2064
[15:45:59] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057
[15:46:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174)
[15:46:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[15:47:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm
[15:47:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err...
[15:47:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2064.codfw.wmnet with OS bullseye
[15:47:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2064
[15:48:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:50:24] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174)
[15:50:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:50:49] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174)
[15:52:11] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057\.codfw\.wmnet
[15:53:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2064 - bking@cumin2002"
[15:53:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2064 - bking@cumin2002"
[15:53:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:53:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2064.codfw.wmnet 109.16.192.10.in-addr.arpa 9.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:53:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2064.codfw.wmnet 109.16.192.10.in-addr.arpa 9.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:53:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2064
[15:53:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2064
[15:53:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2064
[15:58:04] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174)
[15:58:22] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174)
[15:58:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "cloudcephosd2004-dev: put into service"" [puppet] - 10https://gerrit.wikimedia.org/r/1137764 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott)
[15:58:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[15:58:39] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2056\.codfw\.wmnet
[15:58:41] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2057\.codfw\.wmnet
[15:58:43] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2058\.codfw\.wmnet
[15:58:45] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2059\.codfw\.wmnet
[15:59:36] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2060\.codfw\.wmnet
[15:59:39] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2062\.codfw\.wmnet
[15:59:41] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2063\.codfw\.wmnet
[15:59:43] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2065\.codfw\.wmnet
[15:59:46] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2066\.codfw\.wmnet
[15:59:48] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2067\.codfw\.wmnet
[15:59:51] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2068\.codfw\.wmnet
[15:59:53] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2069\.codfw\.wmnet
[15:59:55] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2070\.codfw\.wmnet
[15:59:57] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2072\.codfw\.wmnet
[16:00:00] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2074\.codfw\.wmnet
[16:00:02] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2075\.codfw\.wmnet
[16:00:05] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2077\.codfw\.wmnet
[16:00:07] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2079\.codfw\.wmnet
[16:00:10] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2082\.codfw\.wmnet
[16:00:12] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2085\.codfw\.wmnet
[16:00:14] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2087\.codfw\.wmnet
[16:00:17] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2088\.codfw\.wmnet
[16:00:19] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2089\.codfw\.wmnet
[16:00:22] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2090\.codfw\.wmnet
[16:00:27] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2091\.codfw\.wmnet
[16:00:30] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2097\.codfw\.wmnet
[16:00:48] <inflatador>	 sorry for the spam, just realized it was logging each one
[16:00:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:03:44] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2103.codfw.wmnet|cirrussearch2104.codfw.wmnet|cirrussearch2105.codfw.wmnet|cirrussearch2107.codfw.wmnet|cirrussearch2109.codfw.wmnet|cirrussearch2111.codfw.wmnet|cirrussearch2112.codfw.wmnet|cirrussearch2114.codfw.wmnet|cirrussearch2115.codfw.wmnet
[16:05:18] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1137798 (https://phabricator.wikimedia.org/T392366)
[16:05:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[16:06:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1137798 (https://phabricator.wikimedia.org/T392366) (owner: 10Andrew Bogott)
[16:09:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage
[16:11:35] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1030.eqiad.wmnet with reason: Decommissioning — T378725
[16:11:36] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2089 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:11:38] <stashbot>	 T378725: Refresh aqs1013 w/ aqs1022 - https://phabricator.wikimedia.org/T378725
[16:12:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2064.codfw.wmnet with reason: host reimage
[16:13:31] <urandom>	 !log decommissioning Cassandra/restbase1030-{a,b,c} — T389423
[16:13:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:35] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[16:14:42] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:20:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:21:44] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2059 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:22:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:23:45] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:25:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:28:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1191 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2064.codfw.wmnet with OS bullseye
[16:38:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:36] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2088 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:42:38] <icinga-wm>	 PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158815 MB (4% inode=99%): /var/lib/hadoop/data/f 156640 MB (4% inode=99%): /var/lib/hadoop/data/j 158068 MB (4% inode=99%): /var/lib/hadoop/data/m 154727 MB (4% inode=99%): /var/lib/hadoop/data/h 157595 MB (4% inode=99%): /var/lib/hadoop/data/k 159019 MB (4% inode=99%): /var/lib/hadoop/data/e 159428 MB (4% inode=99%): /var/lib/hadoop/data
[16:42:38] <icinga-wm>	 3 MB (5% inode=99%): /var/lib/hadoop/data/b 154444 MB (4% inode=99%): /var/lib/hadoop/data/d 153665 MB (4% inode=99%): /var/lib/hadoop/data/i 154340 MB (4% inode=99%): /var/lib/hadoop/data/l 144922 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops
[16:49:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1191 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:00] <wikibugs>	 (03PS1) 10Andrew Bogott: invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803
[16:52:50] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:57:20] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2066 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1700)
[17:00:05] <jouncebot>	 ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T1700).
[17:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:12:44] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel)
[17:18:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2094 to cirrussearch2094
[17:18:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:19:04] <wikibugs>	 (03CR) 10Dzahn: "does not affect hosts serving traffic - the compiler failure isn't real - it's a case of "only works with the change" https://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[17:20:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[17:22:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2094 to cirrussearch2094 - bking@cumin2002"
[17:22:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2094 to cirrussearch2094 - bking@cumin2002"
[17:22:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2094
[17:23:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2094
[17:23:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2094 to cirrussearch2094
[17:24:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2094.codfw.wmnet with OS bullseye
[17:24:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2094
[17:24:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:24:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2050.codfw.wmnet with OS bookworm
[17:24:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm
[17:30:29] <logmsgbot>	 bking@cumin2002 reimage (PID 3444540) is awaiting input
[17:32:18] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:34:06] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2085 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:35:18] <icinga-wm>	 PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:20] <icinga-wm>	 PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:20] <icinga-wm>	 PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:20] <icinga-wm>	 PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:24] <icinga-wm>	 RECOVERY - Host msw1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms
[17:36:26] <icinga-wm>	 RECOVERY - Host ps1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms
[17:36:28] <icinga-wm>	 RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.15 ms
[17:36:28] <icinga-wm>	 RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms
[17:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:38:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage
[17:41:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage
[17:42:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2094 - bking@cumin2002"
[17:42:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2094 - bking@cumin2002"
[17:42:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2094.codfw.wmnet 230.16.192.10.in-addr.arpa 0.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2094.codfw.wmnet 230.16.192.10.in-addr.arpa 0.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2094
[17:42:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2094
[17:42:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2094
[17:46:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:47:19] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:49:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel)
[17:51:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:55:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1187 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:56:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:57:05] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:57:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[17:59:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage
[18:00:13] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2070 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:00:29] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3444753) is awaiting input
[18:00:29] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9643 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:02:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2094.codfw.wmnet with reason: host reimage
[18:04:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:05:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:05:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2050.codfw.wmnet with OS bookworm
[18:05:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm completed: - gane...
[18:10:30] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2067 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:18:00] <wikibugs>	 (03CR) 10Majavah: [C:04-1] invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott)
[18:19:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757498 (10Jhancock.wm) figured it out. gonna finish the rest this evening :fingers-crossed:
[18:20:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10757501 (10Jhancock.wm)
[18:22:01] <wikibugs>	 (03PS2) 10Andrew Bogott: invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803
[18:22:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2094.codfw.wmnet with OS bullseye
[18:22:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1187 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:26:40] <wikibugs>	 (03PS4) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325)
[18:27:23] <wikibugs>	 (03PS1) 10Jforrester: ZString: Don't explode if we're handed an array with odd contents [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137813 (https://phabricator.wikimedia.org/T392370)
[18:32:41] <wikibugs>	 (03PS5) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325)
[18:32:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[18:35:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:35:46] <wikibugs>	 (03PS6) 10DDesouza: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325)
[19:08:59] <wikibugs>	 (03PS1) 10Jdrewniak: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816
[19:23:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:33:07] <wikibugs>	 (03CR) 10Bearloga: Create EventStream configuration for PES1.3 Wikirun Game (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak)
[19:39:44] <wikibugs>	 (03PS2) 10Jdrewniak: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816
[19:40:24] <wikibugs>	 (03PS1) 10Bernard Wang: Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881)
[19:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:40:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[19:40:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[19:44:20] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2111 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:45:13] <wikibugs>	 (03PS1) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889)
[19:46:01] <wikibugs>	 (03PS2) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889)
[19:46:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak)
[19:47:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2095 to cirrussearch2095
[19:47:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:48:16] <wikibugs>	 (03CR) 10LorenMora: [C:03+1] Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[19:48:50] <wikibugs>	 (03CR) 10Dillon: [C:03+1] Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[19:49:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:51:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2095 to cirrussearch2095 - bking@cumin2002"
[19:51:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2095 to cirrussearch2095 - bking@cumin2002"
[19:51:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:51:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2095
[19:52:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2095
[19:52:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2095 to cirrussearch2095
[19:52:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:53:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2095.codfw.wmnet with OS bullseye
[19:53:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2095
[19:53:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:53:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:56:42] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2109 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:57:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:58:10] <wikibugs>	 (03PS4) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[19:58:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[19:59:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2095 - bking@cumin2002"
[19:59:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2095 - bking@cumin2002"
[19:59:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:59:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2095.codfw.wmnet 232.16.192.10.in-addr.arpa 2.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:59:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2095.codfw.wmnet 232.16.192.10.in-addr.arpa 2.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:00:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2095
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2000).
[20:00:04] <jouncebot>	 danisztls, bwang, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2095
[20:00:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2095
[20:01:08] <danisztls>	 o/
[20:01:52] <jan_drewniak>	 o/
[20:02:05] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:02:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:03:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:04:19] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2090 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:06:17] <bwang>	 Hello! Here for the window
[20:06:32] <jan_drewniak>	 hey danisztls , bwang , looks like it's just config changes, I can do the deploy today :) 
[20:08:05] <jan_drewniak>	 I'm going to do all three at once, I don't think there's a dependency between any of them. 
[20:09:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2110 to cirrussearch2110
[20:09:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:09:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[20:09:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak)
[20:10:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:10:13] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2063 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:10:32] <wikibugs>	 (03Merged) 10jenkins-bot: Design Research Participant Survey: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:10:33] <jan_drewniak>	 bwang: is the patch on the schedule just duplicated, or should there be a different patch there too? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2000
[20:10:35] <wikibugs>	 (03Merged) 10jenkins-bot: Enable reading list beta feature for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[20:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: Create EventStream configuration for PES1.3 Wikirun Game [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak)
[20:10:57] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]]
[20:11:02] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:11:02] <stashbot>	 T390881: Enable extension:ReadingList as beta feature on beta cluster - https://phabricator.wikimedia.org/T390881
[20:12:52] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1137818/5323/" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:14:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2110 to cirrussearch2110 - bking@cumin2002"
[20:14:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2110 to cirrussearch2110 - bking@cumin2002"
[20:14:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:14:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2110
[20:15:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2110
[20:15:43] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak, dani, bwang: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:16:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2110 to cirrussearch2110
[20:16:30] <jan_drewniak>	 danisztls, bwang changes are ready to test on mwdebug
[20:17:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage
[20:18:12] <danisztls>	 jan_drewniak: looks good
[20:18:17] <bwang>	 Ok checking
[20:20:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2110.codfw.wmnet with OS bullseye
[20:20:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10757776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2110.codfw.w...
[20:20:20] <bwang>	 I think I having issues with my wikimediadebug extension...
[20:20:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2095.codfw.wmnet with reason: host reimage
[20:22:59] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak, dani, bwang: Continuing with sync
[20:23:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:25:01] <wikibugs>	 (03PS5) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[20:26:56] <wikibugs>	 (03PS6) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[20:27:07] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:29:41] <logmsgbot>	 !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137568|Design Research Participant Survey: Deploy (T392325)]], [[gerrit:1137817|Enable reading list beta feature for beta cluster (T390881)]], [[gerrit:1137816|Create EventStream configuration for PES1.3 Wikirun Game]] (duration: 18m 44s)
[20:29:46] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:29:47] <stashbot>	 T390881: Enable extension:ReadingList as beta feature on beta cluster - https://phabricator.wikimedia.org/T390881
[20:31:46] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2082 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:32:55] <jan_drewniak>	 Ok backport sync done 
[20:33:26] <danisztls>	 jan_drewniak: thanks!
[20:36:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage
[20:38:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2110.codfw.wmnet with reason: host reimage
[20:40:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2095.codfw.wmnet with OS bullseye
[20:51:37] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2105 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:59:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2110.codfw.wmnet with OS bullseye
[20:59:33] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2058 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:59:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10757835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2110.codfw.wmnet...
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2100). Please do the needful.
[21:03:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2096 to cirrussearch2096
[21:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:03:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:08:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2096 to cirrussearch2096 - bking@cumin2002"
[21:11:10] <logmsgbot>	 bking@cumin2002 rename (PID 3675984) is awaiting input
[21:11:11] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2077 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:14:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2096 to cirrussearch2096 - bking@cumin2002"
[21:14:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:14:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2096
[21:14:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2096
[21:15:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2096 to cirrussearch2096
[21:23:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2071 to cirrussearch2071
[21:23:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:27:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2071 to cirrussearch2071 - bking@cumin2002"
[21:28:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2071 to cirrussearch2071 - bking@cumin2002"
[21:28:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:28:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2071
[21:28:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2071
[21:29:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2071 to cirrussearch2071
[21:31:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2071.codfw.wmnet with OS bullseye
[21:31:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2071
[21:31:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:33:19] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2056 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:34:20] <wikibugs>	 (03PS1) 10Ryan Kemper: rolling-operation: (proof of concept) manually output commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1137824
[21:35:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2071 - bking@cumin2002"
[21:35:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2071 - bking@cumin2002"
[21:35:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:35:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2071.codfw.wmnet 70.32.192.10.in-addr.arpa 0.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:35:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2071.codfw.wmnet 70.32.192.10.in-addr.arpa 0.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:35:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2071
[21:36:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2071
[21:36:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2071
[21:36:29] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:37:09] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[21:37:38] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610
[21:37:42] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[21:37:43] <logmsgbot>	 !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610
[21:38:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610
[21:41:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rolling-operation: (proof of concept) manually output commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1137824 (owner: 10Ryan Kemper)
[21:44:21] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[21:46:40] <wikibugs>	 (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[21:46:44] <wikibugs>	 (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[21:46:49] <wikibugs>	 (03CR) 10Dzahn: "This change (or one of the other 2 that were merged at the same time a little while ago today) seems to have broken beta-scap-sync-world. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137816 (owner: 10Jdrewniak)
[21:51:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage
[21:55:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2071.codfw.wmnet with reason: host reimage
[21:55:55] <wikibugs>	 (03CR) 10Dzahn: "in the scap::target class, the relevant line is "home => "/var/lib/${deploy_user}". This is the case if $manage_user is set to true." [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:57:12] <wikibugs>	 (03PS3) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889)
[21:59:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:03:53] <wikibugs>	 (03PS4) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889)
[22:05:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:06:52] <wikibugs>	 (03PS5) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889)
[22:08:28] <wikibugs>	 (03CR) 10Dzahn: "puppet compiler run linked above ran on "C:scap" so it picked one host for each regex in site.pp with roles that include scap and showed n" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:11:25] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2103 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:12:47] <wikibugs>	 (03PS1) 10Dzahn: phabricator/scap: disable scap bootstrapping on phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889)
[22:15:31] <wikibugs>	 (03CR) 10DDesouza: "Sorry about that though I don't think this config change is the issue. The code is similar to previous deployments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[22:20:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2071.codfw.wmnet with OS bullseye
[22:20:19] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2079 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:37] <icinga-wm>	 RECOVERY - Disk space on an-worker1139 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops
[22:30:22] <wikibugs>	 (03CR) 10Dzahn: "ACK! I just said that because it seemed they were updated in the same minute when I looked at the repo. Sorry as well for the noise. It's " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137568 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[22:35:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:35:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "just to not leave puppet broken on a host in setup" [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:40:27] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2055 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:42:04] <mutante>	 that alert on crm2001 is both known and weird. known because it's https://phabricator.wikimedia.org/T383715 and WIP but also weird because it links to alerts.wikimedia.org where it does not show up.. it's silenced yet still talks on IRC.. 
[22:44:56] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (6 nodes at a time) for ElasticSearch cluster search_codfw: test manual mode - ryankemper@cumin2002 - T388610
[22:45:00] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[22:45:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "removed errors about bootstrapping but we still have "Package[phabricator/deployment]: Provider scap3 is not functional on this host"" [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:47:54] <wikibugs>	 (03PS1) 10Dzahn: phabricator: comment out scap::target in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889)
[22:49:05] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2060 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:52:29] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1137830/5326/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[22:52:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator: comment out scap::target in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1137830 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2300)
[23:22:31] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2087 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:39:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835
[23:39:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835 (owner: 10TrainBranchBot)
[23:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:49:26] <Reedy>	 jouncebot: nowandnext
[23:49:27] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250421T2300)
[23:49:27] <jouncebot>	 In 2 hour(s) and 10 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0200)
[23:50:40] <wikibugs>	 (03PS1) 10Reedy: InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837
[23:51:16] <wikibugs>	 (03CR) 10Reedy: Enable reading list beta feature for beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137817 (https://phabricator.wikimedia.org/T390881) (owner: 10Bernard Wang)
[23:51:34] <wikibugs>	 (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837 (owner: 10Reedy)
[23:52:01] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9400 on cirrussearch2112 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:52:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137835 (owner: 10TrainBranchBot)
[23:52:21] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Fix ReadingList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137837 (owner: 10Reedy)
[23:53:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:53:53] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2069 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:58:11] <icinga-wm>	 RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2072 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration