[00:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:01:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P75952 and previous config saved to /var/cache/conftool/dbconfig/20250513-000157-fceratto.json [00:02:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:02:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1144687 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1144687 (owner: 10TrainBranchBot) [00:17:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T392806)', diff saved to https://phabricator.wikimedia.org/P75953 and previous config saved to /var/cache/conftool/dbconfig/20250513-001704-fceratto.json [00:17:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [00:17:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75954 and previous config saved to /var/cache/conftool/dbconfig/20250513-001736-fceratto.json [00:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:24:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75955 and previous config saved to /var/cache/conftool/dbconfig/20250513-002436-fceratto.json [00:27:30] (03PS1) 10Ssingh: Revert "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1144690 [00:28:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1144687 (owner: 10TrainBranchBot) [00:29:02] (03CR) 10Ssingh: [C:03+2] Revert "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1144690 (owner: 10Ssingh) [00:30:32] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_codfw [00:31:56] !log run agent on A:lvs-eqiad to re-enable puppet: T393911 [00:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:59] T393911: Figure out why OpenSearch operational scripts frequently fail to connect - https://phabricator.wikimedia.org/T393911 [00:32:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_codfw [00:39:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P75956 and previous config saved to /var/cache/conftool/dbconfig/20250513-003944-fceratto.json [00:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P75957 and previous config saved to /var/cache/conftool/dbconfig/20250513-005451-fceratto.json [01:00:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fc7d0cc4ed0: Failed to establish a new connection: [Errno 113 [01:00:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [01:01:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [01:01:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:05:56] (03PS2) 10Scott French: P:mw:maint:update_flaggedrev_stats: migrate to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144692 (https://phabricator.wikimedia.org/T388535) [01:07:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.1 [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1144694 (https://phabricator.wikimedia.org/T392171) [01:07:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.1 [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1144694 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [01:09:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75958 and previous config saved to /var/cache/conftool/dbconfig/20250513-010959-fceratto.json [01:10:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [01:10:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T392806)', diff saved to https://phabricator.wikimedia.org/P75959 and previous config saved to /var/cache/conftool/dbconfig/20250513-011026-fceratto.json [01:18:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T392806)', diff saved to https://phabricator.wikimedia.org/P75960 and previous config saved to /var/cache/conftool/dbconfig/20250513-011827-fceratto.json [01:48:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P75962 and previous config saved to /var/cache/conftool/dbconfig/20250513-014841-fceratto.json [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0200) [02:03:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T392806)', diff saved to https://phabricator.wikimedia.org/P75963 and previous config saved to /var/cache/conftool/dbconfig/20250513-020349-fceratto.json [02:04:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [02:04:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T392806)', diff saved to https://phabricator.wikimedia.org/P75964 and previous config saved to /var/cache/conftool/dbconfig/20250513-020415-fceratto.json [02:11:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T392806)', diff saved to https://phabricator.wikimedia.org/P75965 and previous config saved to /var/cache/conftool/dbconfig/20250513-021112-fceratto.json [02:26:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P75966 and previous config saved to /var/cache/conftool/dbconfig/20250513-022619-fceratto.json [02:41:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P75967 and previous config saved to /var/cache/conftool/dbconfig/20250513-024127-fceratto.json [02:54:14] (03PS1) 10Bartosz Dziewoński: Update for Parsoid's rename of XMLSerializer to XHtmlSerializer [extensions/DiscussionTools] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1144715 (https://phabricator.wikimedia.org/T393983) [02:55:49] (03CR) 10RLazarus: [C:03+2] scap: Loud deprecation warning for mwscript, now officially unsupported [puppet] - 10https://gerrit.wikimedia.org/r/1144668 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:56:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T392806)', diff saved to https://phabricator.wikimedia.org/P75968 and previous config saved to /var/cache/conftool/dbconfig/20250513-025634-fceratto.json [02:58:26] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0300) [03:03:26] RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:02] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:21:42] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:21:58] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1068 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:22:02] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1068 is CRITICAL: CRITICAL - skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:40:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10814673 (10Jhancock.wm) a:03Jhancock.wm [03:42:42] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:48:02] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:57:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10814682 (10Jhancock.wm) a:03Jhancock.wm [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0400) [04:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:25] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.25 (duration: 04m 17s) [04:05:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10814692 (10Jhancock.wm) a:03Jhancock.wm [04:15:41] 06SRE, 06MediaWiki-Platform-Team: I am blocked from accessing the beta cluster - https://phabricator.wikimedia.org/T393985 (10matmarex) 03NEW [04:16:18] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986 (10Jhancock.wm) 03NEW [04:19:00] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1069 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:22:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10814733 (10Jhancock.wm) a:03Jhancock.wm [04:27:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10814740 (10Jhancock.wm) a:03Jhancock.wm [04:33:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10814745 (10Jhancock.wm) a:05Jhancock.wm→03None [04:36:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10814752 (10Jhancock.wm) [04:38:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10814770 (10Jhancock.wm) a:03Jhancock.wm [04:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:13] (03PS1) 10Marostegui: installserver: Add db2244 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1144736 (https://phabricator.wikimedia.org/T393195) [04:58:15] (03CR) 10Marostegui: [C:03+2] installserver: Add db2244 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1144736 (https://phabricator.wikimedia.org/T393195) (owner: 10Marostegui) [05:02:26] (03PS1) 10Marostegui: mariadb: Add db2244 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1144737 (https://phabricator.wikimedia.org/T393195) [05:03:43] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2244 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1144737 (https://phabricator.wikimedia.org/T393195) (owner: 10Marostegui) [05:05:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10814784 (10Marostegui) Patches done [05:06:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10814785 (10Marostegui) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:08] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:15:08] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [05:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1031 es2029 T391921', diff saved to https://phabricator.wikimedia.org/P75969 and previous config saved to /var/cache/conftool/dbconfig/20250513-051617-marostegui.json [05:16:21] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:16:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2029.codfw.wmnet with reason: Maintenance [05:17:54] (03PS1) 10Marostegui: es1031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1144739 (https://phabricator.wikimedia.org/T391921) [05:19:58] PROBLEM - Host es1031 #page is DOWN: PING CRITICAL - Packet loss = 100% [05:21:50] RECOVERY - Host es1031 #page is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [05:22:02] PROBLEM - mysqld processes #page on es1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:22:04] I downtimed it [05:22:26] PROBLEM - MariaDB read only es3 on es1031 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:22:38] !incidents [05:22:38] 6118 (ACKED) Host es1031 (paged) - PING - Packet loss = 100% [05:22:38] 6119 (ACKED) es1031 (paged)/mysqld processes (paged) [05:22:38] 6117 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [05:22:39] 6115 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [05:22:39] 6114 (RESOLVED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [05:22:58] I will do it again [05:23:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1031.eqiad.wmnet with reason: Maintenance [05:24:21] <_joe_> wth? [05:24:34] (03CR) 10Marostegui: [C:03+2] es1031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1144739 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:24:47] <_joe_> (the etcd cluster health alert, not the es one) [05:27:03] (03PS1) 10Marostegui: es2029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1144740 (https://phabricator.wikimedia.org/T391921) [05:28:02] RECOVERY - mysqld processes #page on es1031 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:28:25] (03CR) 10Marostegui: [C:03+2] es2029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1144740 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:28:28] RECOVERY - MariaDB read only es3 on es1031 is OK: Version 10.11.11-MariaDB-log, Uptime 57s, read_only: True, event_scheduler: True, 9.23 QPS, connection latency: 0.024609s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:29:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75970 and previous config saved to /var/cache/conftool/dbconfig/20250513-052913-root.json [05:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75971 and previous config saved to /var/cache/conftool/dbconfig/20250513-053102-root.json [05:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75972 and previous config saved to /var/cache/conftool/dbconfig/20250513-054418-root.json [05:46:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75973 and previous config saved to /var/cache/conftool/dbconfig/20250513-054607-root.json [05:48:41] (03PS1) 10Muehlenhoff: Update debt entry for LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1144745 [05:49:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:50:00] (03CR) 10CI reject: [V:04-1] Update debt entry for LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1144745 (owner: 10Muehlenhoff) [05:50:58] (03PS2) 10Muehlenhoff: Update debt entry for LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1144745 [05:52:17] (03CR) 10CI reject: [V:04-1] Update debt entry for LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1144745 (owner: 10Muehlenhoff) [05:53:36] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1070 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:53:36] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1070 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:54:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1144649 (https://phabricator.wikimedia.org/T393595) (owner: 10BCornwall) [05:56:25] (03PS3) 10Muehlenhoff: Remove LDAP tracking access for debt [puppet] - 10https://gerrit.wikimedia.org/r/1144745 [05:59:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:59:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75974 and previous config saved to /var/cache/conftool/dbconfig/20250513-055924-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0600) [06:00:05] marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0600). [06:01:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75975 and previous config saved to /var/cache/conftool/dbconfig/20250513-060113-root.json [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:01:59] (03CR) 10Alexandros Kosiaris: [C:03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1144668 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [06:02:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:09:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10814834 (10Marostegui) [06:10:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10814837 (10Marostegui) a:05Marostegui→03None [06:10:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10814839 (10Marostegui) [06:10:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10814841 (10Marostegui) a:05Marostegui→03None [06:14:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75976 and previous config saved to /var/cache/conftool/dbconfig/20250513-061430-root.json [06:16:14] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1113 is CRITICAL: CRITICAL - skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:16:16] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1116 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:16:16] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1120 is CRITICAL: CRITICAL - skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:16:16] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1123 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:16:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75977 and previous config saved to /var/cache/conftool/dbconfig/20250513-061618-root.json [06:16:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f0b59ac4ed0: Failed to establish a new connection: [Errno 113 [06:16:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:17:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards: 0, number_of_pending_ta [06:17:24] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:20:32] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1112 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:20:34] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1119 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:22:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f5255e7ced0: Failed to establish a new connection: [Errno 113 [06:22:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:23:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [06:23:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:24:48] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1112 is CRITICAL: CRITICAL - skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:24:50] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1115 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:24:50] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1122 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:24:50] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1119 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:24:57] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10814847 (10MoritzMuehlenhoff) >>! In T391083#10810288, @Volans wrote: > So the error for the debmonitor client is due by the fact that in `/etc/os-r... [06:29:06] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1111 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:29:08] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1118 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:29:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75978 and previous config saved to /var/cache/conftool/dbconfig/20250513-062935-root.json [06:31:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75979 and previous config saved to /var/cache/conftool/dbconfig/20250513-063123-root.json [06:32:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:32:12] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:33:24] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1114 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:33:24] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1121 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:34:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:36:28] PROBLEM - ElasticSearch unassigned shard check - 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:37:40] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1117 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:37:40] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1114 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:37:41] (03PS1) 10Alexandros Kosiaris: function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) [06:38:14] (03Abandoned) 10Alexandros Kosiaris: [DNM]: Add mw-wikifunctions-ro to deployment server listeners [puppet] - 10https://gerrit.wikimedia.org/r/1144577 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [06:38:34] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1125 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:38:35] (03CR) 10CI reject: [V:04-1] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [06:38:38] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1125 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:39:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:39:27] (03CR) 10Bartosz Wójtowicz: ml-inference-services: edit-check experirmental prod deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [06:41:58] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1113 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:58] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1120 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75980 and previous config saved to /var/cache/conftool/dbconfig/20250513-064440-root.json [06:46:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75981 and previous config saved to /var/cache/conftool/dbconfig/20250513-064629-root.json [06:50:36] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:52:36] (03PS1) 10Slyngshede: SSH Key: Add comment [software/bitu] - 10https://gerrit.wikimedia.org/r/1145031 [06:53:09] (03CR) 10Slyngshede: [C:03+1] Remove LDAP tracking access for debt [puppet] - 10https://gerrit.wikimedia.org/r/1144745 (owner: 10Muehlenhoff) [06:55:36] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:56:26] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm enable django-vite [puppet] - 10https://gerrit.wikimedia.org/r/1144550 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [06:58:28] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP tracking access for debt [puppet] - 10https://gerrit.wikimedia.org/r/1144745 (owner: 10Muehlenhoff) [06:59:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75982 and previous config saved to /var/cache/conftool/dbconfig/20250513-065946-root.json [07:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75983 and previous config saved to /var/cache/conftool/dbconfig/20250513-070135-root.json [07:05:43] (03CR) 10Slyngshede: [C:03+2] SSH Key: Add comment [software/bitu] - 10https://gerrit.wikimedia.org/r/1145031 (owner: 10Slyngshede) [07:05:49] (03PS2) 10Filippo Giunchedi: zuul: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) [07:05:49] (03PS2) 10Filippo Giunchedi: airflow: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) [07:05:49] (03PS2) 10Filippo Giunchedi: graphite: remove access to port 2003 tcp/udp [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) [07:06:07] (03CR) 10Filippo Giunchedi: "I've expanded on the commit message too, HTH" [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [07:06:14] (03CR) 10Filippo Giunchedi: airflow: disable statsd_exporter relaying to graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [07:07:39] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: move to native trace sampling 0.1% [puppet] - 10https://gerrit.wikimedia.org/r/1140135 (https://phabricator.wikimedia.org/T392994) (owner: 10Filippo Giunchedi) [07:08:24] (03Merged) 10jenkins-bot: SSH Key: Add comment [software/bitu] - 10https://gerrit.wikimedia.org/r/1145031 (owner: 10Slyngshede) [07:09:40] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991 (10LSobanski) 03NEW [07:11:31] (03CR) 10Filippo Giunchedi: [C:03+1] "Very nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [07:11:41] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#10814904 (10ayounsi) a:05ayounsi→03Papaul Re-assigning it to Papaul to do the change on `ulsfo` and `eqsin`. It is a good training opportunity, and would remove moving p... [07:14:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75984 and previous config saved to /var/cache/conftool/dbconfig/20250513-071451-root.json [07:16:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75985 and previous config saved to /var/cache/conftool/dbconfig/20250513-071639-root.json [07:17:04] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10814924 (10MatthewVernon) [07:21:18] (03CR) 10DCausse: [C:03+2] wdqs: check max lag on wdqs-main and wdqs-sholarly [alerts] - 10https://gerrit.wikimedia.org/r/1144474 (owner: 10DCausse) [07:23:07] (03Merged) 10jenkins-bot: wdqs: check max lag on wdqs-main and wdqs-sholarly [alerts] - 10https://gerrit.wikimedia.org/r/1144474 (owner: 10DCausse) [07:24:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10814931 (10MatthewVernon) @VRiley-WMF apologies, but any chance you can get thanos-fe1007 to at least PXE-boot OK, please? I don't think I can useful... [07:26:09] (03PS1) 10Muehlenhoff: Add mysql grants for cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1145043 (https://phabricator.wikimedia.org/T393990) [07:29:08] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:29:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:29:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75986 and previous config saved to /var/cache/conftool/dbconfig/20250513-072956-root.json [07:30:44] (03CR) 10Ayounsi: [C:03+2] Interface: add validator for child + non-virtual [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [07:30:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:31:30] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:31:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75987 and previous config saved to /var/cache/conftool/dbconfig/20250513-073145-root.json [07:31:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:32:26] (03CR) 10Brouberol: [C:03+1] Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:32:30] (03CR) 10Brouberol: [C:03+2] Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:33:14] (03Merged) 10jenkins-bot: Interface: add validator for child + non-virtual [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [07:33:54] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:34:10] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:34:47] (03PS1) 10Muehlenhoff: Add cumin1003 as mysql root client [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) [07:34:53] (03PS6) 10Brouberol: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [07:35:02] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:35:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:36:44] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:37:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:02] 06SRE, 10Wikimedia-Mailing-lists: Postorius (held and) reported full headers get mangled somewhere in the system - https://phabricator.wikimedia.org/T309492#10814993 (10Aklapper) 05Open→03Declined Unfortunately closing this Phabricator task as no further information has been provided. @grin: If this s... [07:38:30] (03CR) 10Brouberol: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [07:38:53] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:39:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:40:45] (03CR) 10Arnaudb: [C:03+2] gerrit: enable bacula backups on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [07:40:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:41:50] (03PS1) 10Klausman: site/preseed: Add soon-to-arrive ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) [07:42:18] (03CR) 10Brouberol: "Nice! You only need to bump the chart version and you should be gtg!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [07:42:51] (03PS2) 10Klausman: site/preseed: Add soon-to-arrive ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) [07:43:24] (03CR) 10Brouberol: [C:03+1] "Aah, this is where it comes from! Nice find" [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [07:44:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10815008 (10klausman) a:05klausman→03None [07:47:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:10] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:48:22] ayounsi@cumin1002 update-extras (PID 1755243) is awaiting input [07:48:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:54] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:49:30] (03CR) 10Brouberol: "Wait, isn't that a different patch, related to HTML enterprise?" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [07:51:18] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:44] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:52:10] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5529/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [07:52:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:54:00] !log imported python-wmflib 1.3.1+deb13u1 to trixie-wikimedia T391083 [07:54:01] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5531/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [07:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:03] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [07:54:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f3ee64c4ed0: Failed to establish a new connection: [Errno 113 [07:54:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:55:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [07:55:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:55:38] !log delete all unterminated cables - T393188 [07:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:41] T393188: Netbox: unterminated cables - https://phabricator.wikimedia.org/T393188 [07:56:17] (03CR) 10Muehlenhoff: [C:03+2] Stop installing prometheus-ethtool-exporter on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) (owner: 10Muehlenhoff) [07:58:01] (03CR) 10Cathal Mooney: [C:03+1] Interface: add validator for child + non-virtual [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [08:00:05] jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0800). [08:00:24] morning, train is currently blocked on T393992 [08:00:24] T393992: Fatal error: Uncaught Error: Class "WmfConfig" not found in /srv/mediawiki-staging/multiversion/bin/expanddblist:12 - https://phabricator.wikimedia.org/T393992 [08:01:24] @Krinkle, tgr_: you seem to be the best folks to ping about that ^ [08:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:16] (03PS2) 10Majavah: openstack: Use IPv6 dualstack network for image creation [puppet] - 10https://gerrit.wikimedia.org/r/1142546 [08:03:16] (03PS1) 10Majavah: P:openstack: keystone: Update ACLs for cloud-private v6 [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) [08:03:18] (03PS1) 10Majavah: P:openstack: rabbitmq: Add cloud-private v6 nets to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) [08:04:07] !log imported python-wmflib 1.3.1+deb13u1 to trixie-wikimedia T391083 [08:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:11] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [08:05:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5532/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [08:06:33] (03PS2) 10Majavah: P:openstack: rabbitmq: Add cloud-private v6 nets to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) [08:07:42] (03PS1) 10D3r1ck01: SUL3: Fix account creation by username & email (with temp password) [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145096 (https://phabricator.wikimedia.org/T390751) [08:08:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5534/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [08:10:20] (03PS1) 10Volans: Add support for Python 3.13 and Debian Trixie [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145097 (https://phabricator.wikimedia.org/T391083) [08:11:56] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 3856 [08:12:03] !log copied prometheus-rsyslog-exporter 1.0.0+git20221110-1 from bookworm-wikimedia to trixie-wikimedia T391083 [08:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:06] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [08:13:05] (03PS1) 10Vgutierrez: Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) [08:13:29] (03CR) 10Volans: "Tested on python 3.13 locally, doing some more extensive tests on sretest1001 right now." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145097 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [08:13:44] (03CR) 10Gkyziridis: ml-inference-services: edit-check experirmental prod deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [08:13:47] (03PS2) 10Vgutierrez: Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) [08:15:19] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1142546 (owner: 10Majavah) [08:15:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) (owner: 10Vgutierrez) [08:16:32] ayounsi@cumin1002 peering (PID 1758866) is awaiting input [08:16:43] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [08:17:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:17:59] (03PS1) 10Cathal Mooney: Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) [08:18:08] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:19:20] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [08:20:27] (03CR) 10Majavah: [C:03+2] openstack: Use IPv6 dualstack network for image creation [puppet] - 10https://gerrit.wikimedia.org/r/1142546 (owner: 10Majavah) [08:21:19] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3856 [08:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:22:18] jouncebot: next [08:22:19] In 1 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1000) [08:22:37] (03CR) 10Ayounsi: [C:04-1] "Not ready for prime-time, requires T393996 and a matching pfw BGP alert." [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:23:41] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991#10815152 (10ayounsi) a:03ayounsi sent an email to PCH [08:28:30] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10815188 (10elukey) >>! In T393796#10813932, @herron wrote: >> 2. In the Pyrra Grafana dashboards that are exported. Ideally we'd want to avoid setting the time... [08:28:31] (03PS1) 10Muehlenhoff: Stop installing dstat on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145100 (https://phabricator.wikimedia.org/T391083) [08:29:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fd5f19fee10: Failed to establish a new connection: [Errno 113 [08:29:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [08:30:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:31:40] !log pfw1-codfw - delete specific system-services in favor of "any-service" T390052 [08:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:43] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [08:32:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145100 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [08:32:17] (03PS1) 10Zabe: expanddblist: Add missing use statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145101 (https://phabricator.wikimedia.org/T393992) [08:32:34] (03CR) 10Elukey: [C:03+1] Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:32:57] (03CR) 10Hashar: [C:03+1] "Ohhhh that is clearer now. Indeed I don't think there is any need for statsd anymore after your team has:" [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [08:33:09] (03PS2) 10Cathal Mooney: Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) [08:33:34] (03CR) 10Elukey: [C:03+1] Add support for Python 3.13 and Debian Trixie [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145097 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [08:33:45] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin: Add bwojtowicz to ML-related accesses [puppet] - 10https://gerrit.wikimedia.org/r/1144649 (https://phabricator.wikimedia.org/T393595) (owner: 10BCornwall) [08:34:01] (03PS3) 10Vgutierrez: Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) [08:34:30] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [08:34:35] !log pfw1-eqiad - delete specific system-services in favor of "any-service" T390052 [08:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] (03CR) 10Volans: [C:03+1] "LGTM, I think there is also a grant file to be updated, not sure if that needs to be done at a later stage though" [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [08:35:17] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) (owner: 10Vgutierrez) [08:35:22] !log bounce thanos-query on titan1* [08:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:53] (03CR) 10Hashar: [C:03+1] "I guess that is due to I0dd980fdb946cc82a20d941f52b1eea5f9ebfe2e . I also which we had Phan on that repo but that is a different topic! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145101 (https://phabricator.wikimedia.org/T393992) (owner: 10Zabe) [08:37:19] (03CR) 10Volans: [C:03+2] Add support for Python 3.13 and Debian Trixie [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145097 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [08:37:25] (03CR) 10Filippo Giunchedi: [C:03+2] zuul: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [08:38:10] (03CR) 10Muehlenhoff: "Yeah, that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145043 I'll merge this patch after grants are deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [08:38:40] (03CR) 10Zabe: [C:03+2] expanddblist: Add missing use statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145101 (https://phabricator.wikimedia.org/T393992) (owner: 10Zabe) [08:39:02] (03Abandoned) 10Vgutierrez: Revert^2 "Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1145098 (https://phabricator.wikimedia.org/T393911) (owner: 10Vgutierrez) [08:39:30] (03Merged) 10jenkins-bot: expanddblist: Add missing use statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145101 (https://phabricator.wikimedia.org/T393992) (owner: 10Zabe) [08:39:35] (03CR) 10Vgutierrez: [C:03+1] "nice catch" [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [08:40:19] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1145101|expanddblist: Add missing use statement (T393992)]] [08:40:22] T393992: Fatal error: Uncaught Error: Class "WmfConfig" not found in /srv/mediawiki-staging/multiversion/bin/expanddblist:12 - https://phabricator.wikimedia.org/T393992 [08:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:42:48] (03Merged) 10jenkins-bot: Add support for Python 3.13 and Debian Trixie [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145097 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [08:45:05] !log zabe@deploy1003 zabe: Backport for [[gerrit:1145101|expanddblist: Add missing use statement (T393992)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:45:24] !log zabe@deploy1003 zabe: Continuing with sync [08:46:27] (03CR) 10Filippo Giunchedi: [C:03+1] Stop installing dstat on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145100 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [08:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:50] (03CR) 10Ayounsi: [C:03+2] Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:51:10] (03CR) 10Hashar: [C:03+1] [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [08:51:20] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [08:52:08] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145101|expanddblist: Add missing use statement (T393992)]] (duration: 11m 48s) [08:52:11] T393992: Fatal error: Uncaught Error: Class "WmfConfig" not found in /srv/mediawiki-staging/multiversion/bin/expanddblist:12 - https://phabricator.wikimedia.org/T393992 [08:56:03] (03PS3) 10Cathal Mooney: lvs: add eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) [08:58:00] (03CR) 10Alexandros Kosiaris: [C:03+1] Stop installing dstat on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145100 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [08:59:07] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145102 (https://phabricator.wikimedia.org/T392171) [08:59:09] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145102 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:59:55] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145102 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:59:56] (03CR) 10Vgutierrez: [C:03+2] lvs: add eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145099 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [09:00:46] !log rolling reboot of eqiad load balancers to add E8/F8 interfaces - T393911 | T382017 [09:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:51] T393911: Figure out why OpenSearch operational scripts frequently fail to connect - https://phabricator.wikimedia.org/T393911 [09:00:51] T382017: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017 [09:00:58] (03CR) 10Alexandros Kosiaris: [C:03+1] mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [09:01:32] (03PS3) 10Gkyziridis: ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) [09:02:28] (03CR) 10CI reject: [V:04-1] ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [09:06:25] (03CR) 10Lucas Werkmeister (WMDE): "I don’t think so (due to the `-q0`), but if you know another port that already has the DROP policy, we can try it out…" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [09:06:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:06:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:08:12] PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:08:18] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [09:11:24] (03PS7) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [09:11:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [09:11:39] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.1 refs T392171 [09:11:41] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [09:12:12] PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:12:15] (03CR) 10CI reject: [V:04-1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [09:12:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:13:36] (03CR) 10Ilias Sarantopoulos: ml-inference-services: edit-check experimental prod deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [09:13:41] (03CR) 10Ilias Sarantopoulos: [C:03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [09:13:50] (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [09:14:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT leases) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:14:08] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [09:14:12] RECOVERY - pybal on lvs1020 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:14:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:14:57] !log installing nginx security updates [09:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:09] (03PS4) 10Gkyziridis: ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) [09:16:42] (03CR) 10Alexandros Kosiaris: [C:03+1] apertium: upgrade to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144455 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:16:58] (03CR) 10Alexandros Kosiaris: [C:03+1] cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:17:11] (03CR) 10Alexandros Kosiaris: [C:03+1] developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:17:24] (03CR) 10CI reject: [V:04-1] ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [09:17:36] (03CR) 10Alexandros Kosiaris: [C:03+1] function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:18:06] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [09:20:20] (03CR) 10Muehlenhoff: [C:03+2] Stop installing dstat on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145100 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [09:23:01] (03Abandoned) 10Slyngshede: Alternative SSH key management [software/bitu] - 10https://gerrit.wikimedia.org/r/1113472 (owner: 10Slyngshede) [09:23:21] (03Abandoned) 10Slyngshede: Implement dialog for requesting permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1113471 (owner: 10Slyngshede) [09:23:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:43] ^^ that's me [09:23:48] ack [09:23:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:03] RESOLVED: [2x] KubernetesAPILatency: High Kubernetes API latency (GET configmaps) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:25:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:26:18] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:27:46] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [09:27:46] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:28:46] (03CR) 10Vgutierrez: [C:03+1] "quick check using lua CLI shows that regex work as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/1144581 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [09:30:50] (03PS8) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [09:31:43] (03CR) 10CI reject: [V:04-1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [09:34:02] (03CR) 10Elukey: [C:03+2] apertium: upgrade to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144455 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:34:39] (03PS1) 10Majavah: hieradata: Upgrade codfw1dev horizon to 2025-05-13-092920 [puppet] - 10https://gerrit.wikimedia.org/r/1145107 [09:35:15] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10815443 (10ops-monitoring-bot) Test comment [09:36:45] (03CR) 10Majavah: [C:03+2] hieradata: Upgrade codfw1dev horizon to 2025-05-13-092920 [puppet] - 10https://gerrit.wikimedia.org/r/1145107 (owner: 10Majavah) [09:38:02] !log imported confd 0.16.0-1+deb13u0 to trixie-wikimedia T391083 [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:05] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [09:38:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:39:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:41:02] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [09:41:47] (03PS1) 10Majavah: hieradata: Upgrade eqiad1 Horizon to 2025-05-13-092920 [puppet] - 10https://gerrit.wikimedia.org/r/1145109 [09:42:45] (03CR) 10Hnowlan: [C:03+2] trafficserver: route a smaller subset of enwiki pcs pages without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1144581 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [09:43:19] jouncebot: nowandnext [09:43:19] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T0800) [09:43:19] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1000) [09:43:29] (03CR) 10Majavah: [C:03+2] hieradata: Upgrade eqiad1 Horizon to 2025-05-13-092920 [puppet] - 10https://gerrit.wikimedia.org/r/1145109 (owner: 10Majavah) [09:44:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [09:44:18] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:44:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:46:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit2002), No backups: 1 (gerrit2002), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:47:46] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [09:48:18] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:48:38] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:49:24] !log installing wget security updates [09:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:52] (03PS1) 10Ayounsi: Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) [09:50:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:51:05] (03CR) 10CI reject: [V:04-1] Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:51:46] !log Route all PCS calls for enwiki articles starting with A via rest-gateway and without restbase [09:51:47] (03PS1) 10Gkyziridis: testing-testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145116 (https://phabricator.wikimedia.org/T0000) [09:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:10] (03Abandoned) 10Gkyziridis: testing-testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145116 (https://phabricator.wikimedia.org/T0000) (owner: 10Gkyziridis) [09:52:46] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 83 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [09:53:08] (03PS1) 10Gkyziridis: testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145117 (https://phabricator.wikimedia.org/T0000) [09:53:56] (03PS2) 10Ayounsi: Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) [09:54:00] (03CR) 10CI reject: [V:04-1] testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145117 (https://phabricator.wikimedia.org/T0000) (owner: 10Gkyziridis) [09:55:53] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:55:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:53] (03PS1) 10Gkyziridis: testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145119 (https://phabricator.wikimedia.org/T0000) [09:57:46] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:57:48] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:57:48] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:57:51] (03CR) 10CI reject: [V:04-1] testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145119 (https://phabricator.wikimedia.org/T0000) (owner: 10Gkyziridis) [09:58:28] 07Puppet, 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268#10815486 (10taavi) 05Resolved→03Open a:05jbond→03None Re-opening as the `purge_sudoers_d` flag was never actually enabled. [09:58:28] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10815489 (10MoritzMuehlenhoff) There's some augeas-related output spam on Puppet runs, already reported as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1098696 [09:59:01] FIRING: HelmReleaseBadStatus: Helm release mw-web/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:59:04] (03PS1) 10Gkyziridis: testing-testing ADASF Bug: T00aeasd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145120 (https://phabricator.wikimedia.org/T00) [09:59:20] (03Abandoned) 10Gkyziridis: testing-testing ADASF Bug: T00aeasd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145120 (https://phabricator.wikimedia.org/T00) (owner: 10Gkyziridis) [10:00:00] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1000) [10:00:28] PSA: train is still running [10:01:56] RESOLVED: HelmReleaseBadStatus: Helm release mw-web/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:04:01] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-web/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:04:37] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [10:06:07] (03PS1) 10Majavah: P:puppet: locate-unmanaged: Sort entries before printing [puppet] - 10https://gerrit.wikimedia.org/r/1145121 [10:06:14] (03PS1) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: A-I [puppet] - 10https://gerrit.wikimedia.org/r/1145122 (https://phabricator.wikimedia.org/T393591) [10:06:15] (03PS1) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: numeric [puppet] - 10https://gerrit.wikimedia.org/r/1145123 (https://phabricator.wikimedia.org/T393591) [10:06:16] (03PS1) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: lowercase [puppet] - 10https://gerrit.wikimedia.org/r/1145124 (https://phabricator.wikimedia.org/T393591) [10:06:16] (03Abandoned) 10Gkyziridis: testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145119 (https://phabricator.wikimedia.org/T0000) (owner: 10Gkyziridis) [10:06:17] (03PS1) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: all pages [puppet] - 10https://gerrit.wikimedia.org/r/1145125 (https://phabricator.wikimedia.org/T393591) [10:06:40] (03Abandoned) 10Gkyziridis: testing-testing ADASF Bug: T0000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145117 (https://phabricator.wikimedia.org/T0000) (owner: 10Gkyziridis) [10:06:56] RESOLVED: [2x] HelmReleaseBadStatus: Helm release mw-web/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:06:56] (03PS5) 10Gkyziridis: ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) [10:07:25] (03PS2) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: A-I [puppet] - 10https://gerrit.wikimedia.org/r/1145122 (https://phabricator.wikimedia.org/T393591) [10:07:51] (03CR) 10CI reject: [V:04-1] ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [10:07:56] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [10:08:48] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:08:48] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:09:48] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:09:48] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:10:00] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 18 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [10:10:46] (03CR) 10Jgiannelos: [C:03+1] trafficserver: enwiki regex for restbaseless routing: A-I [puppet] - 10https://gerrit.wikimedia.org/r/1145122 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:11:49] (03CR) 10Btullis: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [10:13:16] (03CR) 10Hnowlan: [C:03+2] trafficserver: enwiki regex for restbaseless routing: A-I [puppet] - 10https://gerrit.wikimedia.org/r/1145122 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:14:08] (03CR) 10Ayounsi: "could/should we make those alert the frack team as well?" [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:14:12] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:14:54] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.1 refs T392171 [10:14:58] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [10:16:06] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [10:16:12] (03PS1) 10Federico Ceratto: zarcillo: values.yaml: Fix typo, remove comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145127 (https://phabricator.wikimedia.org/T384212) [10:16:12] (03CR) 10Federico Ceratto: "As discussed on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145127 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:17:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1144715 (https://phabricator.wikimedia.org/T393983) (owner: 10Bartosz Dziewoński) [10:17:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 06Release-Engineering-Team, 10vm-requests: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10815565 (10MoritzMuehlenhoff) If we can pick freely, then let's use codfw/C. [10:18:29] (03Merged) 10jenkins-bot: Update for Parsoid's rename of XMLSerializer to XHtmlSerializer [extensions/DiscussionTools] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1144715 (https://phabricator.wikimedia.org/T393983) (owner: 10Bartosz Dziewoński) [10:18:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:58] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1144715|Update for Parsoid's rename of XMLSerializer to XHtmlSerializer (T393983)]] [10:19:01] T393983: `Error: Class "Wikimedia\Parsoid\Wt2Html\XMLSerializer" not found` in PHPUnit tests - https://phabricator.wikimedia.org/T393983 [10:19:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET namespaces) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:19:46] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:20:32] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:20:56] (03CR) 10Volans: P:puppet: locate-unmanaged: Sort entries before printing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145121 (owner: 10Majavah) [10:21:54] (03PS2) 10Majavah: P:puppet: locate-unmanaged: Sort entries before printing [puppet] - 10https://gerrit.wikimedia.org/r/1145121 [10:22:06] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:22:06] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1017 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:22:16] (03CR) 10Majavah: P:puppet: locate-unmanaged: Sort entries before printing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145121 (owner: 10Majavah) [10:22:58] (03PS1) 10JMeybohm: Refactor sre.discovery's use of resolve_with_client_ip [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) [10:23:17] (03PS1) 10Volans: docstrings: update examples [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145130 [10:23:48] (03CR) 10Clément Goubert: [C:03+1] zarcillo: values.yaml: Fix typo, remove comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145127 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:24:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (GET namespaces) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:24:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1028 es2027 T391921', diff saved to https://phabricator.wikimedia.org/P75988 and previous config saved to /var/cache/conftool/dbconfig/20250513-102455-marostegui.json [10:24:58] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [10:25:19] (03PS1) 10Marostegui: es1028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145132 (https://phabricator.wikimedia.org/T391921) [10:25:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2027.codfw.wmnet,es1028.eqiad.wmnet with reason: Maintenance [10:26:19] !log jnuche@deploy1003 matmarex, jnuche: Backport for [[gerrit:1144715|Update for Parsoid's rename of XMLSerializer to XHtmlSerializer (T393983)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:26:21] T393983: `Error: Class "Wikimedia\Parsoid\Wt2Html\XMLSerializer" not found` in PHPUnit tests - https://phabricator.wikimedia.org/T393983 [10:26:27] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:26:29] !log jnuche@deploy1003 matmarex, jnuche: Continuing with sync [10:27:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:27:38] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1145121 (owner: 10Majavah) [10:29:32] (03CR) 10Majavah: [C:03+2] P:puppet: locate-unmanaged: Sort entries before printing [puppet] - 10https://gerrit.wikimedia.org/r/1145121 (owner: 10Majavah) [10:31:05] (03PS2) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: numeric [puppet] - 10https://gerrit.wikimedia.org/r/1145123 (https://phabricator.wikimedia.org/T393591) [10:32:02] Hey folks \o [10:32:02] I had created a patch for review on gerrit on the deployment-charts: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144521 . [10:32:02] The tests were succeeding when suddenly after a small change in the commit-message the tests are failing throwing these in Jenkins: [10:32:02] - `rake aborted!` [10:32:02] - `NoMethodError: undefined method filter!' for true:TrueClass` [10:32:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:32:24] (03CR) 10Marostegui: [C:03+2] es1028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145132 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [10:32:30] Does anyone know what are these ? ^^ [10:33:11] (03CR) 10Jgiannelos: [C:03+1] trafficserver: enwiki regex for restbaseless routing: numeric [puppet] - 10https://gerrit.wikimedia.org/r/1145123 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:33:53] georgekyz: probably introduced by a CI change. We're currently investigating (cc akosiaris) [10:34:43] (03PS1) 10Marostegui: es2027: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145135 (https://phabricator.wikimedia.org/T391921) [10:35:37] !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144715|Update for Parsoid's rename of XMLSerializer to XHtmlSerializer (T393983)]] (duration: 16m 38s) [10:35:40] T393983: `Error: Class "Wikimedia\Parsoid\Wt2Html\XMLSerializer" not found` in PHPUnit tests - https://phabricator.wikimedia.org/T393983 [10:35:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75989 and previous config saved to /var/cache/conftool/dbconfig/20250513-103548-root.json [10:35:50] (03CR) 10Hnowlan: [C:03+2] trafficserver: enwiki regex for restbaseless routing: numeric [puppet] - 10https://gerrit.wikimedia.org/r/1145123 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:35:54] (03CR) 10Marostegui: [C:03+2] es2027: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145135 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [10:36:30] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:36:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:37:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:38:08] !log jayme@cumin1002 START - Cookbook sre.discovery.datacenter [10:38:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [10:38:11] (03Abandoned) 10Slyngshede: C:raid:perccli do not error out if controller is no in use [puppet] - 10https://gerrit.wikimedia.org/r/1126542 (owner: 10Slyngshede) [10:38:55] !log jayme@cumin1002 START - Cookbook sre.discovery.datacenter [10:38:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [10:38:57] jayme: Thank you very much for the quick response. [10:39:09] (03CR) 10Jgiannelos: [C:03+1] trafficserver: enwiki regex for restbaseless routing: lowercase [puppet] - 10https://gerrit.wikimedia.org/r/1145124 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:39:16] (03PS1) 10JMeybohm: sre.discovery.datacenter: Raise CookbookInitSuccess on status action [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 [10:39:24] (03CR) 10Jgiannelos: [C:03+1] trafficserver: enwiki regex for restbaseless routing: all pages [puppet] - 10https://gerrit.wikimedia.org/r/1145125 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:40:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:40:28] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [10:40:51] !log train finished [10:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:53] (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [10:42:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [10:43:44] PROBLEM - Host lvs1017 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:02] RECOVERY - Host lvs1017 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [10:44:32] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:44:46] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:45:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:45:08] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:45:32] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:45:46] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:46:10] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:47:06] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 8 connections established with conf1007.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:47:06] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [10:48:13] (03PS1) 10Filippo Giunchedi: Fix non-existent team data-platform-sre [puppet] - 10https://gerrit.wikimedia.org/r/1145138 (https://phabricator.wikimedia.org/T393858) [10:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75990 and previous config saved to /var/cache/conftool/dbconfig/20250513-104825-root.json [10:48:56] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75991 and previous config saved to /var/cache/conftool/dbconfig/20250513-105053-root.json [10:51:28] (03CR) 10Filippo Giunchedi: "Try e.g. 2004/tcp" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [10:51:51] (03CR) 10Filippo Giunchedi: "My bad, I mean 2006/tcp" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [10:52:06] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1017 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:53:59] (03CR) 10Lucas Werkmeister (WMDE): "Works fine, `nc` just silently exits nonzero immediately:" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [10:54:03] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:37] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10815724 (10cmooney) Link remains stable that I can see, there are no errors reported in either the switch or host side stats. For the record the device is a BCM57414 NIC, in PCIe... [10:56:10] (03CR) 10Brouberol: airflow: cleanup deployment charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [10:56:17] (03CR) 10Volans: [C:04-1] "Approach looks ok, I think there are minor nits to fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) (owner: 10JMeybohm) [10:56:38] (03PS2) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: lowercase [puppet] - 10https://gerrit.wikimedia.org/r/1145124 (https://phabricator.wikimedia.org/T393591) [10:57:11] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10815745 (10Ifrahkhanyaree_WMDE) 05Open→03Resolved Closing the ticket as all t... [10:57:23] (03CR) 10Volans: [C:03+1] "LGTM, thanks for using the new feature!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 (owner: 10JMeybohm) [10:57:44] (03CR) 10Volans: [C:03+2] docstrings: update examples [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145130 (owner: 10Volans) [10:58:01] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:59:03] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:11] (03CR) 10Filippo Giunchedi: "localhost is not going to be a reliable test, this hangs for me:" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [11:01:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:02:20] (03CR) 10Filippo Giunchedi: "We could route based on (additional) labels for sure (team is single-team) or something in the alert name or instance=~^pfw and scope=netw" [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:02:23] (03Merged) 10jenkins-bot: docstrings: update examples [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145130 (owner: 10Volans) [11:03:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75992 and previous config saved to /var/cache/conftool/dbconfig/20250513-110330-root.json [11:04:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:04] (03CR) 10Clément Goubert: [V:03+2 C:03+2] zarcillo: values.yaml: Fix typo, remove comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145127 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [11:04:56] (03CR) 10CI reject: [V:04-1] zarcillo: values.yaml: Fix typo, remove comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145127 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [11:05:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:06:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75993 and previous config saved to /var/cache/conftool/dbconfig/20250513-110559-root.json [11:06:19] (03CR) 10Lucas Werkmeister (WMDE): "Good point, added: I5d921fbc9f" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [11:07:51] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.3.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145146 [11:07:57] (03CR) 10Marostegui: [C:03+1] Add cumin1003 as mysql root client [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [11:10:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:10:49] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.3.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145146 (owner: 10Volans) [11:12:13] (03CR) 10Cathal Mooney: [C:03+1] "Yes this one primarily needs to be visible by netops/IF. But it definitely would be nice if frtech could also be made aware I think." [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:14:08] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:15:35] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.3.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1145146 (owner: 10Volans) [11:16:06] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [11:16:59] (03CR) 10Klausman: [V:03+1 C:03+2] site/preseed: Add soon-to-arrive ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [11:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75994 and previous config saved to /var/cache/conftool/dbconfig/20250513-111836-root.json [11:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75995 and previous config saved to /var/cache/conftool/dbconfig/20250513-112104-root.json [11:21:23] (03PS1) 10Marostegui: instances: Add db2241, db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1145151 (https://phabricator.wikimedia.org/T390530) [11:23:01] (03CR) 10Marostegui: [C:03+2] instances: Add db2241, db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1145151 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [11:27:36] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:31:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2241 and db2242 future x3 hosts, to s8 T390530', diff saved to https://phabricator.wikimedia.org/P75996 and previous config saved to /var/cache/conftool/dbconfig/20250513-113138-marostegui.json [11:31:42] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [11:32:36] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:33:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75998 and previous config saved to /var/cache/conftool/dbconfig/20250513-113342-root.json [11:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75999 and previous config saved to /var/cache/conftool/dbconfig/20250513-113610-root.json [11:38:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76000 and previous config saved to /var/cache/conftool/dbconfig/20250513-113810-root.json [11:38:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76001 and previous config saved to /var/cache/conftool/dbconfig/20250513-113816-root.json [11:38:56] (03PS1) 10Marostegui: db2241,db2242: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145165 (https://phabricator.wikimedia.org/T390530) [11:39:43] (03CR) 10Hnowlan: [C:03+2] trafficserver: enwiki regex for restbaseless routing: lowercase [puppet] - 10https://gerrit.wikimedia.org/r/1145124 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [11:39:56] jouncebot: nowandnext [11:39:56] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [11:39:56] In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1200) [11:40:34] (03CR) 10Marostegui: [C:03+2] db2241,db2242: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145165 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [11:40:59] !log tchin@deploy1003 Started deploy [airflow-dags/analytics@146dab1]: Deploying airflow artifacts for T384962 [11:41:02] T384962: Implement alerting for wmf_content.mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T384962 [11:43:20] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@146dab1]: Deploying airflow artifacts for T384962 (duration: 02m 44s) [11:44:10] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:44:29] (03CR) 10Clément Goubert: [C:03+1] P:mw:maint:update_flaggedrev_stats: migrate to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144692 (https://phabricator.wikimedia.org/T388535) (owner: 10Scott French) [11:45:06] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [11:47:45] (03PS1) 10Filippo Giunchedi: alertmanager: route network pfw alerts to fr [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) [11:48:32] (03CR) 10Filippo Giunchedi: [C:03+1] "SGTM, I have I148ed592a out" [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:48:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76002 and previous config saved to /var/cache/conftool/dbconfig/20250513-114847-root.json [11:51:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76003 and previous config saved to /var/cache/conftool/dbconfig/20250513-115115-root.json [11:53:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P76005 and previous config saved to /var/cache/conftool/dbconfig/20250513-115317-root.json [11:53:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P76006 and previous config saved to /var/cache/conftool/dbconfig/20250513-115322-root.json [11:56:08] (03PS3) 10Ayounsi: Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) [11:57:46] (03PS4) 10Ayounsi: Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) [11:59:21] (03CR) 10Ayounsi: [C:03+2] Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:59:46] (03PS1) 10Marostegui: db1255: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145171 (https://phabricator.wikimedia.org/T390530) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1200) [12:00:34] (03Merged) 10jenkins-bot: Add alerting for eqiad pfw BGP core sessions [alerts] - 10https://gerrit.wikimedia.org/r/1145115 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:01:06] (03CR) 10Ayounsi: "Change lgtm, I'll leave it to the fr-tech team to decide if they want the alerts or not." [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [12:01:16] (03CR) 10Marostegui: [C:03+2] db1255: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145171 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:29] (03PS1) 10Volans: Upstream release v1.3.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1145172 [12:02:40] (03CR) 10Volans: [C:03+2] Upstream release v1.3.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1145172 (owner: 10Volans) [12:02:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76007 and previous config saved to /var/cache/conftool/dbconfig/20250513-120352-root.json [12:04:09] (03PS1) 10Marostegui: instances.yaml: Add db1255 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145173 (https://phabricator.wikimedia.org/T390530) [12:05:57] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10815978 (10MoritzMuehlenhoff) [12:06:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1255 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145173 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76008 and previous config saved to /var/cache/conftool/dbconfig/20250513-120621-root.json [12:06:32] !log installing ucf security updates [12:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:18] (03Merged) 10jenkins-bot: Upstream release v1.3.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1145172 (owner: 10Volans) [12:08:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1255 future x3 hosts, to s8 T390530', diff saved to https://phabricator.wikimedia.org/P76009 and previous config saved to /var/cache/conftool/dbconfig/20250513-120853-marostegui.json [12:08:57] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [12:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P76010 and previous config saved to /var/cache/conftool/dbconfig/20250513-120901-root.json [12:09:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P76011 and previous config saved to /var/cache/conftool/dbconfig/20250513-120902-root.json [12:10:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76012 and previous config saved to /var/cache/conftool/dbconfig/20250513-121018-root.json [12:13:37] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021 (10cmooney) 03NEW p:05Triage→03Medium [12:14:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10816007 (10cmooney) [12:14:07] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10816008 (10cmooney) [12:14:37] (03CR) 10Jelto: [C:03+1] "thanks for uploading a fix! This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1145138 (https://phabricator.wikimedia.org/T393858) (owner: 10Filippo Giunchedi) [12:18:39] !log uploaded python3-wmflib_1.3.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:43] (03PS2) 10Ayounsi: Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) [12:18:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76014 and previous config saved to /var/cache/conftool/dbconfig/20250513-121858-root.json [12:19:46] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [12:21:14] 07Puppet, 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268#10816016 (10taavi) {P76015} [12:21:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76016 and previous config saved to /var/cache/conftool/dbconfig/20250513-122126-root.json [12:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:22:18] !log installing libapache2-mod-auth-openidc security updates [12:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145096 (https://phabricator.wikimedia.org/T390751) (owner: 10D3r1ck01) [12:22:26] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5535/co" [puppet] - 10https://gerrit.wikimedia.org/r/1144578 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:23:33] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "I have verified there's no unexpected cross-site flows between prometheus etcd, proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/1144601 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [12:24:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P76017 and previous config saved to /var/cache/conftool/dbconfig/20250513-122406-root.json [12:24:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P76018 and previous config saved to /var/cache/conftool/dbconfig/20250513-122407-root.json [12:24:37] (03CR) 10Jelto: [V:03+1 C:03+1] "looks good to me in PCC and per documentation in https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Tell_the_deployment_serv" [puppet] - 10https://gerrit.wikimedia.org/r/1144578 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:25:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P76019 and previous config saved to /var/cache/conftool/dbconfig/20250513-122523-root.json [12:25:50] (03PS1) 10Marostegui: db1256: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145182 (https://phabricator.wikimedia.org/T390530) [12:28:48] (03CR) 10Filippo Giunchedi: alertmanager: route network pfw alerts to fr (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [12:28:51] (03PS2) 10Filippo Giunchedi: alertmanager: route network devices alerts to fr [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) [12:29:13] (03PS1) 10Urbanecm: [Growth] eswiki: Bump mentorship to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) [12:30:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10816085 (10MoritzMuehlenhoff) [12:31:04] (03CR) 10Marostegui: [C:03+2] db1256: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145182 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:31:23] !log ayounsi@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: cr3-eqsin upgrade, T364092] [12:31:27] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:31:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: cr3-eqsin upgrade, T364092] [12:34:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76020 and previous config saved to /var/cache/conftool/dbconfig/20250513-123404-root.json [12:34:19] (03PS1) 10Marostegui: instances.yaml: Add db1256 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145185 (https://phabricator.wikimedia.org/T390530) [12:36:04] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-eqsin with reason: upgrade [12:36:12] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10816116 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a82fd52b-494d-4956-9f75-7cd844fe0007) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their servic... [12:36:25] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1256 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145185 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:36:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76021 and previous config saved to /var/cache/conftool/dbconfig/20250513-123631-root.json [12:36:52] (03CR) 10Filippo Giunchedi: [C:03+1] eventstreams: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144470 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:37:17] (03CR) 10Filippo Giunchedi: [C:03+1] eventgate: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144469 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:38:45] (03CR) 10AOkoth: [C:03+2] deployment: add miscweb aux deploy user [puppet] - 10https://gerrit.wikimedia.org/r/1144578 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:39:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1256 future x3 hosts, to s8 T390530', diff saved to https://phabricator.wikimedia.org/P76022 and previous config saved to /var/cache/conftool/dbconfig/20250513-123917-marostegui.json [12:39:21] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [12:39:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76023 and previous config saved to /var/cache/conftool/dbconfig/20250513-123925-root.json [12:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76024 and previous config saved to /var/cache/conftool/dbconfig/20250513-123926-root.json [12:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76025 and previous config saved to /var/cache/conftool/dbconfig/20250513-123935-root.json [12:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P76026 and previous config saved to /var/cache/conftool/dbconfig/20250513-124029-root.json [12:40:49] !log cr3-eqsin# set protocols bgp graceful-shutdown sender - T364092 [12:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:51] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:42:03] (03PS1) 10Volans: debdeploy: add support for Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145187 (https://phabricator.wikimedia.org/T391083) [12:43:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1145187 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [12:44:50] (03CR) 10CI reject: [V:04-1] debdeploy: add support for Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145187 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [12:45:35] (03PS2) 10Volans: debdeploy: add support for Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145187 (https://phabricator.wikimedia.org/T391083) [12:46:47] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [12:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:03] (03CR) 10Muehlenhoff: partman: Add a kubernetes-node-containerd-efi recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [12:47:42] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [12:48:27] (03CR) 10Muehlenhoff: "Once https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143817 is merged, maybe rather use it instead? We're aligning to use UEFI where" [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [12:48:36] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-13-124445-production [puppet] - 10https://gerrit.wikimedia.org/r/1145189 [12:49:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76027 and previous config saved to /var/cache/conftool/dbconfig/20250513-124910-root.json [12:50:24] (03CR) 10Kamila Součková: [C:03+1] Revert^2 "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1144582 (https://phabricator.wikimedia.org/T393236) (owner: 10Dreamy Jazz) [12:50:51] !log trigger full planet import for Bookworm maps master T381565 [12:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:54] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [12:51:22] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-13-124445-production [puppet] - 10https://gerrit.wikimedia.org/r/1145189 (owner: 10Majavah) [12:51:25] (03CR) 10Filippo Giunchedi: [C:03+1] echoserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144468 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:52:38] (03CR) 10Filippo Giunchedi: [C:03+1] datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:53:39] (03CR) 10Volans: [C:03+2] debdeploy: add support for Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1145187 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [12:53:39] !log cr3-eqsin - lower vrrp priority - T364092 [12:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:42] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:54:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76028 and previous config saved to /var/cache/conftool/dbconfig/20250513-125430-root.json [12:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76029 and previous config saved to /var/cache/conftool/dbconfig/20250513-125431-root.json [12:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P76030 and previous config saved to /var/cache/conftool/dbconfig/20250513-125440-root.json [12:55:02] (03CR) 10Filippo Giunchedi: [C:03+1] datahub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144465 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:55:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P76031 and previous config saved to /var/cache/conftool/dbconfig/20250513-125535-root.json [12:56:54] (03PS2) 10JMeybohm: Refactor sre.discovery's use of resolve_with_client_ip [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) [12:56:54] (03PS2) 10JMeybohm: sre.discovery.datacenter: Raise CookbookInitSuccess on status action [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 [12:57:05] (03CR) 10JMeybohm: "Yes I had - but that might not have been enough." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) (owner: 10JMeybohm) [12:57:29] (03CR) 10JMeybohm: "Thanks for putting together that mail summary 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 (owner: 10JMeybohm) [12:57:40] !log cr3-eqsin - shutdown transit/peering BGP sessions - T364092 [12:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:51] !log upgrading python3-wmflib fleetwide (except eqsin for now) [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1300). [13:00:05] pfischer and MatmaRex: A patch you scheduled for UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] Hey folks, sorry for the spam, do we have any updates on the CI issue ?: [13:00:07] `rake aborted!` [13:00:07] `NoMethodError: undefined method filter!' for true:TrueClass [13:00:30] !log cr3-eqsin> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-23.4R2-S3.9.tgz - T364092 [13:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:32] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:01:16] hi there! remember you can now use the new amazing flying spider pig for your backports! https://spiderpig.wikimedia.org/ [13:01:28] (03Abandoned) 10Elukey: istio: introduce legacy images to backport features [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1144612 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [13:01:52] hi [13:02:30] o/ [13:02:41] I can deploy! [13:02:59] o/ [13:03:04] (03CR) 10Klausman: [V:03+1 C:03+2] "Oh, I was not aware of that change. Good call!" [puppet] - 10https://gerrit.wikimedia.org/r/1145086 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [13:04:12] pfischer: is this the change you accidentally merged yesterday or am I imagining things? ^^ [13:04:45] (03PS2) 10Lucas Werkmeister (WMDE): CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:04:55] Lucas_WMDE: yes [13:04:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(added trailing commas so the next diff will still look nice)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:04:59] ok ^^ [13:05:19] let’s see if PS2 yields a proper diffConfig [13:05:42] (03PS1) 10AOkoth: wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) [13:05:42] it does, yay [13:05:53] (on PS1 it was broken due to unrelated errors at the time) [13:06:12] Oh, okay [13:06:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:06:43] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10816238 (10ssingh) Thanks @RobH! @cmooney: yeah, I updated the task description to reflect that but we though we should get this checked out anyway, since it's the integrated NIC.... [13:07:00] jnuche: are we supposed to see a notification when we’re already on the spiderpig tab? or is it only when the tab is in the background? [13:07:05] (03PS1) 10AOkoth: trafficserver: update os-reports replacment url [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) [13:07:09] (03Merged) 10jenkins-bot: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:07:21] right now I didn’t get one for the “is this the change you want to backport or not” (before it votes CR+2) [13:07:34] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1144600|CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) (T389053)]] [13:07:38] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [13:07:40] (03PS1) 10Bartosz Dziewoński: SUL3: Fix account creation by username & email (with temp password) [extensions/CentralAuth] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145193 (https://phabricator.wikimedia.org/T390751) [13:07:51] (03PS2) 10AOkoth: trafficserver: update os-reports replacment url [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) [13:07:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145193 (https://phabricator.wikimedia.org/T390751) (owner: 10Bartosz Dziewoński) [13:08:14] Lucas_WMDE: yeah, that prompt is an exception, you won't get a notification for it because the prompt happens right after you klick "Start Backport" and it didn't feel like useful feedback [13:08:44] for every other prompt, you should get the notification, even if the window/tab is not in the foreground [13:08:56] I see, fair enough [13:09:12] (03PS3) 10Federico Ceratto: hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) [13:09:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76032 and previous config saved to /var/cache/conftool/dbconfig/20250513-130935-root.json [13:09:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76033 and previous config saved to /var/cache/conftool/dbconfig/20250513-130937-root.json [13:09:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P76034 and previous config saved to /var/cache/conftool/dbconfig/20250513-130945-root.json [13:10:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76035 and previous config saved to /var/cache/conftool/dbconfig/20250513-131040-root.json [13:13:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145096 (https://phabricator.wikimedia.org/T390751) (owner: 10D3r1ck01) [13:13:18] MatmaRex: ^ fyi [13:13:44] thanks [13:13:58] Lucas_WMDE: i also just added a wmf.1 backport of the same change, missed it earlier [13:14:06] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, pfischer: Backport for [[gerrit:1144600|CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) (T389053)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:09] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [13:14:41] (03Merged) 10jenkins-bot: SUL3: Fix account creation by username & email (with temp password) [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145096 (https://phabricator.wikimedia.org/T390751) (owner: 10D3r1ck01) [13:14:47] (03CR) 10Jelto: trafficserver: update os-reports replacment url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:14:50] whoa, that merge was fast [13:14:57] anyway, pfischer, please test on WikimediaDebug :) [13:15:08] !log cr3-eqsin> request vmhost reboot - T364092 [13:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:11] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:15:37] Lucas_WMDE: cannot test that config change, it has no effect, only on a maintenance script. [13:16:08] ok [13:16:14] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, pfischer: Continuing with sync [13:17:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145193 (https://phabricator.wikimedia.org/T390751) (owner: 10Bartosz Dziewoński) [13:17:20] jnuche: fyi T394033 [13:17:20] T394033: SpiderPig sometimes misses notifications - https://phabricator.wikimedia.org/T394033 [13:17:31] I’ll check again during MatmaRex’ backports so I can be absolutely sure that I didn’t reload the tab ^^ [13:17:32] (03PS1) 10Cathal Mooney: Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) [13:18:03] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:18:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:11] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:11] Lucas_WMDE: ack, thanks for filing. Reloading the tab shouldn't affect the behavior anyway [13:18:31] oh, I thought you mentioned that yesterday [13:18:35] ok [13:18:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-eqsin (103.102.166.131) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:18:40] (03PS2) 10Alexandros Kosiaris: function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) [13:18:40] (03PS1) 10Alexandros Kosiaris: CI: Sleep 500ms to allow multithreaded fixture population to work [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145195 [13:18:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:et-0/0/0 (Core: cr3-eqsin:et-0/0/0 {#1116}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:19:03] expected ^ [13:19:03] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10816308 (10ssingh) a:05RobH→03BCornwall [13:19:05] I believe above alerts for BGP/OSPF are due to upgrades Ar zhe l is doing on eqsin core routers [13:19:07] ah, you said “clear the browser’s state”, I guess I misunderstood what it refers to ^^ [13:19:15] (03Merged) 10jenkins-bot: SUL3: Fix account creation by username & email (with temp password) [extensions/CentralAuth] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145193 (https://phabricator.wikimedia.org/T390751) (owner: 10Bartosz Dziewoński) [13:21:03] (03CR) 10AOkoth: trafficserver: update os-reports replacment url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:22:39] Lucas_WMDE: the notifications code stores state in the browser's local storage, if you wipe out that (e.g. often done by the browser when you tell it to clear everything, including cookies, etc) the notifications will stop working. But just reloading the tab should still work [13:22:48] ok [13:22:54] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144600|CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) (T389053)]] (duration: 15m 19s) [13:22:57] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [13:23:23] (03PS1) 10Filippo Giunchedi: netops: test with peer_descr with spaces [alerts] - 10https://gerrit.wikimedia.org/r/1145196 [13:23:23] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10816324 (10herron) >! In T393796#10815188, @elukey wrote: > @herron I think it is perfect, two manual changes are definitely ok for this use case! Do you know... [13:23:47] MatmaRex: starting your backports now [13:23:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1145096|SUL3: Fix account creation by username & email (with temp password) (T390751)]], [[gerrit:1145193|SUL3: Fix account creation by username & email (with temp password) (T390751)]] [13:24:12] T390751: SUL3 broke the ability to send new user's password via email - https://phabricator.wikimedia.org/T390751 [13:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 4.241% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:24:29] yup [13:24:39] (03CR) 10Filippo Giunchedi: [C:03+1] chromium-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144462 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:24:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76036 and previous config saved to /var/cache/conftool/dbconfig/20250513-132441-root.json [13:24:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76037 and previous config saved to /var/cache/conftool/dbconfig/20250513-132442-root.json [13:24:45] (03CR) 10Filippo Giunchedi: [C:03+1] calculator-service: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144459 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:24:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P76038 and previous config saved to /var/cache/conftool/dbconfig/20250513-132451-root.json [13:25:05] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:25:11] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:25:27] (03CR) 10Filippo Giunchedi: [C:03+1] blunderbuss: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144458 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:25:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76039 and previous config saved to /var/cache/conftool/dbconfig/20250513-132545-root.json [13:26:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:26:56] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:15] FIRING: [9x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:28:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.026s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:28:39] (03CR) 10Elukey: [C:03+2] api-gateway: upgrade to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144456 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:28:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-eqsin (103.102.166.131) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:28:49] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 420, down: 7, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:28:51] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:et-0/0/0 (Core: cr3-eqsin:et-0/0/0 {#1116}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:29:01] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 7.595% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:29:29] (03CR) 10Elukey: [C:03+2] aqs-http-gateway: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144457 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:29:39] (03CR) 10Elukey: [C:03+2] blunderbuss: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144458 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:29:42] 09:29:01 <+jinxer-wm> FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - [13:29:47] (03CR) 10Elukey: [C:03+2] calculator-service: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144459 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:29:50] that's the eqiad v6? [13:30:00] (03CR) 10CI reject: [V:04-1] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [13:30:20] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex, d3r1ck01: Backport for [[gerrit:1145096|SUL3: Fix account creation by username & email (with temp password) (T390751)]], [[gerrit:1145193|SUL3: Fix account creation by username & email (with temp password) (T390751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:24] T390751: SUL3 broke the ability to send new user's password via email - https://phabricator.wikimedia.org/T390751 [13:31:18] Lucas_WMDE: works as expected! [13:31:23] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex, d3r1ck01: Continuing with sync [13:31:56] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:33:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.538s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:34:51] (03PS1) 10Kamila Součková: mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) [13:35:30] MatmaRex: thanks for testing! [13:35:37] sorry, was distracted by SpiderPig notification debugging ^^ [13:37:43] (03CR) 10CI reject: [V:04-1] mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:37:57] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145096|SUL3: Fix account creation by username & email (with temp password) (T390751)]], [[gerrit:1145193|SUL3: Fix account creation by username & email (with temp password) (T390751)]] (duration: 14m 07s) [13:38:00] T390751: SUL3 broke the ability to send new user's password via email - https://phabricator.wikimedia.org/T390751 [13:38:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.538s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:38:41] (03CR) 10Filippo Giunchedi: [C:03+2] netops: test with peer_descr with spaces [alerts] - 10https://gerrit.wikimedia.org/r/1145196 (owner: 10Filippo Giunchedi) [13:39:06] (03PS2) 10Kamila Součková: mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) [13:39:39] Lucas_WMDE: all done? thank you [13:39:43] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:39:45] should be, yeah [13:39:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76041 and previous config saved to /var/cache/conftool/dbconfig/20250513-133946-root.json [13:39:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76042 and previous config saved to /var/cache/conftool/dbconfig/20250513-133947-root.json [13:39:52] I’m trying to redeploy it now just to test T394033 [13:39:52] T394033: SpiderPig sometimes misses notifications - https://phabricator.wikimedia.org/T394033 [13:39:54] but that shouldn’t affect you [13:39:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76043 and previous config saved to /var/cache/conftool/dbconfig/20250513-133956-root.json [13:39:59] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1145096|SUL3: Fix account creation by username & email (with temp password) (T390751)]] [13:40:04] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10816412 (10ayounsi) [13:40:06] cool. thanks [13:40:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76044 and previous config saved to /var/cache/conftool/dbconfig/20250513-134051-root.json [13:41:13] (03PS1) 10Eevans: cassandra-jbod.cfg preseed: grow volume to fill space [puppet] - 10https://gerrit.wikimedia.org/r/1145198 (https://phabricator.wikimedia.org/T391544) [13:41:25] (03PS1) 10Ayounsi: Enable PfwCoreBGPDown in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1145199 (https://phabricator.wikimedia.org/T388641) [13:46:12] !log lucaswerkmeister-wmde@deploy1003 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1145096|SUL3: Fix account creation by username & email (with temp password) (T390751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:46:16] T390751: SUL3 broke the ability to send new user's password via email - https://phabricator.wikimedia.org/T390751 [13:46:28] (03PS1) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [13:46:42] (03CR) 10Cathal Mooney: [C:03+1] "I'll just say +1 and not ask why :P" [alerts] - 10https://gerrit.wikimedia.org/r/1145199 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:47:04] (03CR) 10Clément Goubert: [C:03+1] mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:47:06] !log ayounsi@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: cr3-eqsin upgrade finished, T364092] [13:47:09] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:47:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: cr3-eqsin upgrade finished, T364092] [13:47:21] (03CR) 10CI reject: [V:04-1] airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:48:16] (03CR) 10Filippo Giunchedi: [C:03+1] Enable PfwCoreBGPDown in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1145199 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:48:17] (03CR) 10Ayounsi: [C:03+2] Enable PfwCoreBGPDown in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1145199 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:48:28] (03PS1) 10Btullis: Update the managed airflow temp directory for dumps on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145201 (https://phabricator.wikimedia.org/T389784) [13:49:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5536/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145201 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:49:29] jouncebot: nowandnext [13:49:29] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1300) [13:49:29] In 1 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1500) [13:49:29] (03Merged) 10jenkins-bot: Enable PfwCoreBGPDown in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1145199 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:49:29] (03CR) 10Ayounsi: "With I4b9c7b856d05fb11ba7c5a364b22bf33a806e644 there are no blockers anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:49:57] (03CR) 10Kamila Součková: [C:03+2] mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1145197 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:50:45] !log lucaswerkmeister-wmde@deploy1003 Sync cancelled. [13:51:01] (T394033 debugging done, didn’t need the sync to finish) [13:51:01] T394033: SpiderPig sometimes misses notifications - https://phabricator.wikimedia.org/T394033 [13:51:18] (03PS1) 10Ssingh: sre.dns.admin: improve print summary formatting [cookbooks] - 10https://gerrit.wikimedia.org/r/1145203 [13:51:39] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:51:50] !log UTC afternoon backport+config window done [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:54] * Lucas_WMDE done debugging [13:51:57] *deploying :D [13:52:02] (but debugging too :D) [13:52:24] (03CR) 10Filippo Giunchedi: [C:03+1] Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:52:27] Raine: all clear as far as I’m concerned ^^ [13:52:47] thanks Lucas_WMDE <3 [13:53:13] (03CR) 10Muehlenhoff: [C:03+2] puppet: On Trixie install Puppet 7 from component/puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1143746 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [13:53:34] (03CR) 10Hnowlan: [C:03+1] changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:53:57] (03CR) 10Xcollazo: [C:03+1] Update the managed airflow temp directory for dumps on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145201 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:54:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76045 and previous config saved to /var/cache/conftool/dbconfig/20250513-135451-root.json [13:54:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76046 and previous config saved to /var/cache/conftool/dbconfig/20250513-135452-root.json [13:55:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76047 and previous config saved to /var/cache/conftool/dbconfig/20250513-135502-root.json [13:55:22] (03PS2) 10Hnowlan: trafficserver: enwiki regex for restbaseless routing: all pages [puppet] - 10https://gerrit.wikimedia.org/r/1145125 (https://phabricator.wikimedia.org/T393591) [13:55:24] (03CR) 10Filippo Giunchedi: [C:03+1] function-orchestrator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144473 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:55:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76048 and previous config saved to /var/cache/conftool/dbconfig/20250513-135557-root.json [13:57:23] (03CR) 10Ayounsi: [C:03+1] "one nit but lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:58:02] (03CR) 10Filippo Giunchedi: [C:03+1] flink-app: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:58:07] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:58:32] (03CR) 10Hnowlan: [C:03+2] trafficserver: enwiki regex for restbaseless routing: all pages [puppet] - 10https://gerrit.wikimedia.org/r/1145125 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [13:59:59] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:00:22] (03CR) 10Filippo Giunchedi: "Why not! Worth a try, adding Timo" [puppet] - 10https://gerrit.wikimedia.org/r/1144662 (https://phabricator.wikimedia.org/T391677) (owner: 10Cwhite) [14:00:36] !log finalising rollout of restbaseless enwiki PCS APIs routed via rest-gateway [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:55] (03CR) 10Filippo Giunchedi: [C:03+1] function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:01:37] (03CR) 10Bking: [C:03+1] Fix non-existent team data-platform-sre [puppet] - 10https://gerrit.wikimedia.org/r/1145138 (https://phabricator.wikimedia.org/T393858) (owner: 10Filippo Giunchedi) [14:02:58] (03CR) 10Filippo Giunchedi: [C:03+2] Fix non-existent team data-platform-sre [puppet] - 10https://gerrit.wikimedia.org/r/1145138 (https://phabricator.wikimedia.org/T393858) (owner: 10Filippo Giunchedi) [14:03:54] (03CR) 10Filippo Giunchedi: [C:03+1] developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:05:25] (03CR) 10Filippo Giunchedi: [C:03+1] cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:06:34] (03CR) 10Brouberol: [C:03+1] Update the managed airflow temp directory for dumps on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145201 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [14:07:00] (03CR) 10Filippo Giunchedi: [C:03+1] citoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144463 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:07:30] (03CR) 10Btullis: [V:03+1 C:03+2] Update the managed airflow temp directory for dumps on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145201 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [14:08:52] (03CR) 10Giuseppe Lavagetto: [C:03+2] "Oh lord please forgive us for this sin." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145195 (owner: 10Alexandros Kosiaris) [14:09:22] (03CR) 10Filippo Giunchedi: [C:03+1] chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:09:30] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:09:34] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [14:09:52] (03Abandoned) 10Ssingh: P:cache::varnish::frontend: reload vcl in beta [puppet] - 10https://gerrit.wikimedia.org/r/1007953 (https://phabricator.wikimedia.org/T358887) (owner: 10Ssingh) [14:09:56] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:09:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76049 and previous config saved to /var/cache/conftool/dbconfig/20250513-140956-root.json [14:09:57] (03CR) 10DCausse: [C:03+2] team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [14:09:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76050 and previous config saved to /var/cache/conftool/dbconfig/20250513-140958-root.json [14:10:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76051 and previous config saved to /var/cache/conftool/dbconfig/20250513-141007-root.json [14:10:15] (03CR) 10Ssingh: "Let's update this and roll it out?" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:10:20] (03CR) 10Filippo Giunchedi: [C:03+1] changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:11:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76052 and previous config saved to /var/cache/conftool/dbconfig/20250513-141102-root.json [14:11:10] (03Merged) 10jenkins-bot: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [14:11:31] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1071 to cirrussearch1071 [14:12:04] (03CR) 10Ssingh: "(note: templates/wikimedia.org:os-reports 1D IN CNAME dyna.wikimedia.org.)" [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:14:56] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:15:02] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for pc1018 db1258 - jclark@cumin1002" [14:15:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for pc1018 db1258 - jclark@cumin1002" [14:15:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1258.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:17:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host pc1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:18:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:19:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host pc1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:20:08] (03PS1) 10Muehlenhoff: Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) [14:20:28] (03CR) 10Kamila Součková: [C:03+1] "LGTM with a sprinkle of inline paranoia" [puppet] - 10https://gerrit.wikimedia.org/r/1143198 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [14:20:32] bking@cumin2002 rename (PID 2437033) is awaiting input [14:21:43] (03CR) 10Kamila Součková: [C:03+1] P:mw::maint::purge_expired_userrights: purge_expired_global_rights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143199 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [14:22:32] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10816647 (10herron) Yeah, those panels are spiky and confusing. To try and make the better use of the recording rules that are in place today I... [14:23:02] (03CR) 10Kamila Součková: [C:03+1] P:mw::maintenance::refreshlinks: migrate remaining shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144637 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [14:23:47] (03CR) 10Elukey: [C:03+2] changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:23:55] (03CR) 10Elukey: [C:03+2] chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:03] (03CR) 10Elukey: [C:03+2] chromium-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144462 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:10] (03CR) 10Elukey: [C:03+2] citoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144463 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:18] (03CR) 10Elukey: [C:03+2] cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:23] (03PS2) 10Muehlenhoff: Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) [14:24:39] (03CR) 10Elukey: [C:03+2] datahub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144465 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10816669 (10Jclark-ctr) [14:24:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10816670 (10Jclark-ctr) [14:24:52] (03CR) 10Elukey: [C:03+2] datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10816677 (10Jclark-ctr) a:03Jclark-ctr [14:25:02] (03CR) 10Elukey: [C:03+2] developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76053 and previous config saved to /var/cache/conftool/dbconfig/20250513-142501-root.json [14:25:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76054 and previous config saved to /var/cache/conftool/dbconfig/20250513-142503-root.json [14:25:06] (03CR) 10Volans: [C:03+1] "Sounds good :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145203 (owner: 10Ssingh) [14:25:10] (03CR) 10Elukey: [C:03+2] echoserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144468 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76055 and previous config saved to /var/cache/conftool/dbconfig/20250513-142513-root.json [14:25:20] (03CR) 10Elukey: [C:03+2] eventgate: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144469 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:26] (03CR) 10Elukey: [C:03+2] eventstreams: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144470 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:33] (03CR) 10Elukey: [C:03+2] flink-app: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:43] (03CR) 10Elukey: [C:03+2] function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:49] (03CR) 10Elukey: [C:03+2] function-orchestrator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144473 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:26:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76056 and previous config saved to /var/cache/conftool/dbconfig/20250513-142608-root.json [14:26:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in codfw - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:27:50] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: improve print summary formatting [cookbooks] - 10https://gerrit.wikimedia.org/r/1145203 (owner: 10Ssingh) [14:28:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:28:44] (03PS2) 10DCausse: [WIP] changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 [14:28:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:29:18] (03PS3) 10DCausse: changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 [14:29:22] (03Merged) 10jenkins-bot: CI: Sleep 500ms to allow multithreaded fixture population to work [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145195 (owner: 10Alexandros Kosiaris) [14:29:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1071 to cirrussearch1071 - bking@cumin2002" [14:29:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1071 to cirrussearch1071 - bking@cumin2002" [14:29:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1071 on all recursors [14:29:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1071 on all recursors [14:29:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1071 [14:30:15] (03CR) 10CI reject: [V:04-1] changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [14:30:57] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1071 [14:31:05] (03PS1) 10Elukey: ipoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) [14:31:06] (03PS1) 10Elukey: kartotherian: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145215 (https://phabricator.wikimedia.org/T391333) [14:31:06] (03PS1) 10Elukey: kask: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) [14:31:07] (03PS1) 10Elukey: linkrecommendation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145217 (https://phabricator.wikimedia.org/T391333) [14:31:09] (03PS1) 10Elukey: machinetranslation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145218 (https://phabricator.wikimedia.org/T391333) [14:31:10] (03PS1) 10Elukey: mathoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145219 (https://phabricator.wikimedia.org/T391333) [14:31:12] (03PS1) 10Elukey: mediawiki: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) [14:31:16] (03PS1) 10Elukey: mediawiki-dumps-legacy: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145221 (https://phabricator.wikimedia.org/T391333) [14:31:20] (03PS1) 10Elukey: miscweb: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145222 (https://phabricator.wikimedia.org/T391333) [14:31:24] (03PS1) 10Elukey: mobileapps: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145223 (https://phabricator.wikimedia.org/T391333) [14:31:28] (03PS1) 10Elukey: mpic: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145224 (https://phabricator.wikimedia.org/T391333) [14:31:32] (03PS1) 10Elukey: push-notifications: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145225 (https://phabricator.wikimedia.org/T391333) [14:31:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in codfw - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:31:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1071 to cirrussearch1071 [14:31:36] (03PS1) 10Elukey: python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) [14:31:40] (03PS1) 10Elukey: recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) [14:31:44] (03PS1) 10Elukey: shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) [14:31:48] (03PS1) 10Elukey: spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) [14:31:52] (03PS1) 10Elukey: superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) [14:31:56] (03PS1) 10Elukey: tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) [14:32:00] (03PS1) 10Elukey: termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) [14:32:05] (03PS1) 10Elukey: thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) [14:32:17] (03CR) 10Brouberol: "I've tentatively applied this on airflow-test-k8s, and it worked without issues." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [14:32:21] (03CR) 10CI reject: [V:04-1] ipoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:25] (03CR) 10CI reject: [V:04-1] kartotherian: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145215 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:29] (03CR) 10CI reject: [V:04-1] kask: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:30] !log on going maintenance on msw1-codfw [14:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:33] (03CR) 10CI reject: [V:04-1] linkrecommendation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145217 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:37] (03CR) 10CI reject: [V:04-1] machinetranslation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145218 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:41] (03CR) 10CI reject: [V:04-1] mathoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145219 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:49] oh noes there is still an issue with CI :( [14:32:50] (03CR) 10CI reject: [V:04-1] mediawiki: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:32:52] some spam [14:33:05] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145221 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:33:19] (03CR) 10CI reject: [V:04-1] miscweb: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145222 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:33:27] (03PS1) 10Jforrester: Register our magic vars, so the parser knows to ask us what their values are [extensions/WikiLambda] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145234 (https://phabricator.wikimedia.org/T345477) [14:33:36] (03PS1) 10Jforrester: Register our magic vars, so the parser knows to ask us what their values are [extensions/WikiLambda] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145235 (https://phabricator.wikimedia.org/T345477) [14:33:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1072 to cirrussearch1072 [14:34:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:34:07] (03CR) 10CI reject: [V:04-1] mpic: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145224 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:34:14] (03CR) 10CI reject: [V:04-1] mobileapps: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145223 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:34:34] (03CR) 10CI reject: [V:04-1] push-notifications: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145225 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:35:05] (03CR) 10CI reject: [V:04-1] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:35:13] (03Merged) 10jenkins-bot: sre.dns.admin: improve print summary formatting [cookbooks] - 10https://gerrit.wikimedia.org/r/1145203 (owner: 10Ssingh) [14:35:45] (03CR) 10CI reject: [V:04-1] recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:35:45] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10816785 (10elukey) Perfect! Then we should probably try to fix https://github.com/pyrra-dev/pyrra/issues/952 sending a patch to upstream, just to be consistent. [14:36:27] (03CR) 10CI reject: [V:04-1] shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:37:10] (03CR) 10CI reject: [V:04-1] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:37:15] (03PS1) 10MVernon: swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 [14:37:17] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1072 to cirrussearch1072 - bking@cumin2002" [14:37:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1072 to cirrussearch1072 - bking@cumin2002" [14:37:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1072 on all recursors [14:37:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1072 on all recursors [14:37:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1072 [14:37:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:38:03] (03CR) 10CI reject: [V:04-1] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:38:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:38:58] (03CR) 10CI reject: [V:04-1] tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:39:14] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1072 [14:39:52] (03CR) 10CI reject: [V:04-1] termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:39:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1072 to cirrussearch1072 [14:39:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1258.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:40:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76057 and previous config saved to /var/cache/conftool/dbconfig/20250513-144007-root.json [14:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76058 and previous config saved to /var/cache/conftool/dbconfig/20250513-144008-root.json [14:40:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76059 and previous config saved to /var/cache/conftool/dbconfig/20250513-144019-root.json [14:40:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1258.eqiad.wmnet with OS bookworm [14:40:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10816814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm [14:41:10] (03CR) 10CI reject: [V:04-1] thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:41:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76060 and previous config saved to /var/cache/conftool/dbconfig/20250513-144113-root.json [14:41:35] (03CR) 10Gkyziridis: [C:03+1] ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [14:41:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:41:53] (03CR) 10Gkyziridis: [C:03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [14:42:18] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [14:43:03] (03PS2) 10Btullis: Bump nodemanager heap on the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) [14:43:11] (03PS12) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [14:43:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1018.eqiad.wmnet with OS bookworm [14:43:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10816843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm [14:44:04] (03CR) 10Btullis: Bump nodemanager heap on the production Hadoop cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [14:44:07] (03CR) 10Alexandros Kosiaris: [C:03+2] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [14:44:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5537/co" [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [14:44:46] (03CR) 10Krinkle: prometheus: add more recording rules around editResponseTime (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144662 (https://phabricator.wikimedia.org/T391677) (owner: 10Cwhite) [14:44:49] (03CR) 10Btullis: Bump nodemanager heap on the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [14:45:03] (03CR) 10CI reject: [V:04-1] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [14:45:18] (03CR) 10Jelto: trafficserver: update os-reports replacment url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:47:22] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:22] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:22] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:22] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:22] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:24] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:24] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:24] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:24] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:24] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:28] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:28] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:28] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:28] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:28] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:39] oh dear [14:47:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10816859 (10Jclark-ctr) [14:47:58] (03CR) 10JHathaway: [C:03+1] Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [14:48:09] is that expected for the mgmt router work? cc papaul [14:48:22] volans: yes [14:48:46] wb icinga-wm [14:48:46] ack thx [14:49:00] (03CR) 10Brouberol: [C:03+1] Bump nodemanager heap on the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [14:49:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:49:07] (03CR) 10AOkoth: "Will this create any issues?" [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:49:51] (03CR) 10Ssingh: "My bad, I should have clarified it. No, it should not, but just be aware that you have another record under wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:49:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1071.eqiad.wmnet with OS bullseye [14:49:58] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1071 [14:49:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1071 [14:51:03] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:52:35] (03CR) 10Volans: [C:03+1] "Thanks for taking care of this tech debt bit! LGTM but better test it ofc :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) (owner: 10JMeybohm) [14:52:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10816875 (10Jhancock.wm) @cmooney got them swapped for you [14:54:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:54:27] (03PS1) 10AOkoth: add os-reports to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) [14:54:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1072.eqiad.wmnet with OS bullseye [14:54:33] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1072 [14:54:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1072 [14:55:11] (03CR) 10Btullis: [C:03+2] Bump nodemanager heap on the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [14:55:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76061 and previous config saved to /var/cache/conftool/dbconfig/20250513-145513-root.json [14:55:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76062 and previous config saved to /var/cache/conftool/dbconfig/20250513-145514-root.json [14:55:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76063 and previous config saved to /var/cache/conftool/dbconfig/20250513-145525-root.json [14:56:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76064 and previous config saved to /var/cache/conftool/dbconfig/20250513-145620-root.json [14:56:34] (03PS1) 10CDanis: wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) [14:57:26] (03CR) 10CI reject: [V:04-1] wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [14:57:33] (03CR) 10CI reject: [V:04-1] swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [14:58:03] (03PS1) 10Ladsgroup: Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) [14:58:53] (03CR) 10CI reject: [V:04-1] Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [14:59:25] !log maintenance complete on msw1-codfw [14:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1500). nyaa~ [15:00:25] !log tchin@deploy1003 Started deploy [airflow-dags/analytics@0550b16]: Deploying airflow artifacts for T384962 [15:00:29] T384962: Implement alerting for wmf_content.mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T384962 [15:02:25] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@0550b16]: Deploying airflow artifacts for T384962 (duration: 02m 22s) [15:02:32] (03CR) 10Filippo Giunchedi: [C:03+1] Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:15] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1071.eqiad.wmnet with reason: host reimage [15:04:34] (03CR) 10Krinkle: [C:03+1] P:mw::maintenance::refreshlinks: migrate remaining shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144637 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [15:08:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1071.eqiad.wmnet with reason: host reimage [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1072.eqiad.wmnet with reason: host reimage [15:09:37] jclark@cumin1002 reimage (PID 1822584) is awaiting input [15:10:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76066 and previous config saved to /var/cache/conftool/dbconfig/20250513-151031-root.json [15:11:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76067 and previous config saved to /var/cache/conftool/dbconfig/20250513-151125-root.json [15:13:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1072.eqiad.wmnet with reason: host reimage [15:13:12] jclark@cumin1002 reimage (PID 1822806) is awaiting input [15:13:35] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [15:13:54] (03PS1) 10Muehlenhoff: imposm-initial-impact: Follow 302 redirects when fetching the checksum [puppet] - 10https://gerrit.wikimedia.org/r/1145245 (https://phabricator.wikimedia.org/T381565) [15:13:55] (03CR) 10CDanis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [15:15:34] (03PS1) 10Cathal Mooney: Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) [15:16:11] (03CR) 10CI reject: [V:04-1] Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [15:16:50] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:16:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/WikiLambda] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145235 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [15:17:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/WikiLambda] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145234 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [15:17:32] (03PS2) 10Muehlenhoff: imposm-initial-import: Follow 302 redirects when fetching the checksum [puppet] - 10https://gerrit.wikimedia.org/r/1145245 (https://phabricator.wikimedia.org/T381565) [15:17:36] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1143198 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [15:21:23] (03PS2) 10MVernon: swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 [15:22:23] cmooney@cumin1002 netbox (PID 1827503) is awaiting input [15:22:37] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate dns recrods for new codfw switches - cmooney@cumin1002" [15:22:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate dns recrods for new codfw switches - cmooney@cumin1002" [15:22:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:25:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76068 and previous config saved to /var/cache/conftool/dbconfig/20250513-152536-root.json [15:25:39] (03PS1) 10JMeybohm: Revert "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145248 [15:26:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76069 and previous config saved to /var/cache/conftool/dbconfig/20250513-152631-root.json [15:27:25] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [15:27:36] !log on going maintenance on msw1-eqiad [15:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:23] (03CR) 10CI reject: [V:04-1] swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [15:29:24] (03PS2) 10JMeybohm: Revert "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145248 [15:29:24] (03PS1) 10JMeybohm: Revert "CI: Sleep 500ms to allow multithreaded fixture population to work" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145249 [15:29:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:30:42] (03PS3) 10JMeybohm: Revert "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145248 [15:30:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:31:00] (03Abandoned) 10JMeybohm: Revert "CI: Sleep 500ms to allow multithreaded fixture population to work" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145249 (owner: 10JMeybohm) [15:31:46] (03PS1) 10Aleksandar Mastilovic: Set the remaining Gobblin resources to "absent" [puppet] - 10https://gerrit.wikimedia.org/r/1145250 (https://phabricator.wikimedia.org/T390249) [15:32:00] (03CR) 10Ssingh: "Verified on netbox and trusting the script, Luke :)" [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [15:32:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1071.eqiad.wmnet with OS bullseye [15:32:05] (03CR) 10Ssingh: [C:03+1] Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [15:32:17] (03CR) 10Cathal Mooney: [C:03+2] Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1145246 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [15:32:40] (03CR) 10Aleksandar Mastilovic: "It is, I apologize for the confusion." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [15:32:52] !log cmooney@dns2005 START - running authdns-update [15:33:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:33:52] !log cmooney@dns2005 END - running authdns-update [15:35:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1258.eqiad.wmnet with OS bookworm [15:35:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10817116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm executed with errors: - db1258 (**FAIL**)... [15:35:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1258.eqiad.wmnet with OS bookworm [15:35:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10817117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm [15:36:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1072.eqiad.wmnet with OS bullseye [15:39:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:57] (03CR) 10CI reject: [V:04-1] swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [15:40:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76070 and previous config saved to /var/cache/conftool/dbconfig/20250513-154041-root.json [15:42:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1258.eqiad.wmnet with OS bookworm [15:42:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10817137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm executed with errors: - db1258 (**FAIL**)... [15:43:49] (03PS4) 10MVernon: swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 [15:46:12] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:50:13] FIRING: BFDdown: BFD session down between cr3-eqsin and fe80::6687:88ff:fef2:6d50 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:57] (03CR) 10Brouberol: [C:03+1] Set the remaining Gobblin resources to "absent" [puppet] - 10https://gerrit.wikimedia.org/r/1145250 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [15:52:59] (03CR) 10Brouberol: [C:03+2] Set the remaining Gobblin resources to "absent" [puppet] - 10https://gerrit.wikimedia.org/r/1145250 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [15:53:12] (03CR) 10JMeybohm: [C:03+2] Revert "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145248 (owner: 10JMeybohm) [15:54:29] !log mvernon@cumin1002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.ad in eqiad [15:55:13] RESOLVED: BFDdown: BFD session down between cr3-eqsin and fe80::6687:88ff:fef2:6d50 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:55:46] (03CR) 10Kamila Součková: [C:03+1] P:mw::maint::purge_expired_userrights: purge_expired_userrights to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143198 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [15:55:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76071 and previous config saved to /var/cache/conftool/dbconfig/20250513-155547-root.json [15:56:07] (03PS7) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) [15:56:12] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:56:21] !log gnt-instance modify -B memory=10g testreduce1002.eqiad.wmnet - T393904 [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:23] T393904: Bump memory of testreduce1002 - https://phabricator.wikimedia.org/T393904 [15:57:09] (03CR) 10Kamila Součková: [C:03+1] P:mediawiki::php: add uuid extension for PHP 8.1+ [puppet] - 10https://gerrit.wikimedia.org/r/1139947 (https://phabricator.wikimedia.org/T373752) (owner: 10Scott French) [15:57:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.ad in eqiad [15:57:20] (03PS3) 10Alexandros Kosiaris: function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) [15:57:28] !log cgoubert@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM testreduce1002.eqiad.wmnet [15:57:30] (03CR) 10CI reject: [V:04-1] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [15:57:49] (03CR) 10MVernon: "I tested that the remove-ghost-objects cookbook still works with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [15:58:27] (03CR) 10Aleksandar Mastilovic: "@brouberol@wikimedia.org OK I think now we're ready to merge this one too (after deploying the changes from that other MR you just merged)" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [15:58:45] (03CR) 10Aleksandar Mastilovic: Remove support for systemd Gobblin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [16:00:05] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:24] (03CR) 10Brouberol: [C:03+1] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [16:00:29] (03CR) 10Brouberol: [C:03+2] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [16:01:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testreduce1002.eqiad.wmnet [16:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:56] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1018.eqiad.wmnet with OS bookworm [16:09:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10817326 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm executed with errors: - pc1018 (**FAIL**)... [16:09:51] !log dancy@deploy1003 Installing scap version "4.165.0" for 2 host(s) [16:10:00] (03Merged) 10jenkins-bot: Revert "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145248 (owner: 10JMeybohm) [16:11:40] !log dancy@deploy1003 Installation of scap version "4.165.0" completed for 2 hosts [16:14:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:16:18] (03PS1) 10Alexandros Kosiaris: Revert "CI: Sleep 500ms to allow multithreaded fixture population to work" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145256 [16:19:25] (03Abandoned) 10Alexandros Kosiaris: Revert "CI: Sleep 500ms to allow multithreaded fixture population to work" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145256 (owner: 10Alexandros Kosiaris) [16:20:09] (03PS4) 10Alexandros Kosiaris: function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) [16:20:56] !log maintenance complete on msw1-eqiad [16:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:32] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqiad [16:21:36] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad [16:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:22:35] (03CR) 10Alexandros Kosiaris: [C:03+2] function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [16:24:20] (03Merged) 10jenkins-bot: function-orchestrator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144920 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [16:24:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:10] FIRING: BFDdown: BFD session down between cr3-eqsin and fe80::6687:88ff:fef2:6d50 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:26:54] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:27:20] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:27:49] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:28:05] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:28:09] !log maintenance complete on msw2-eqiad [16:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:14] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:28:44] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:30:11] (03PS4) 10DCausse: changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 [16:32:24] !log dancy@deploy1003 Installing scap version "4.166.0" for 2 host(s) [16:33:05] (03PS5) 10DCausse: changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 [16:33:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10817471 (10Jhancock.wm) a:03Jhancock.wm idrac said DIMM_B3 was at fault. i swapped it with DIMM_A3. if reseating it did not fix the issue, we should be able to... [16:34:12] !log dancy@deploy1003 Installation of scap version "4.166.0" completed for 2 hosts [16:35:41] (03PS1) 10Andrew Bogott: Dummy passwords for upcoming octavia install [labs/private] - 10https://gerrit.wikimedia.org/r/1145262 (https://phabricator.wikimedia.org/T393783) [16:37:03] (03PS2) 10Andrew Bogott: Dummy passwords for upcoming octavia install [labs/private] - 10https://gerrit.wikimedia.org/r/1145262 (https://phabricator.wikimedia.org/T393783) [16:38:58] (03CR) 10BCornwall: [C:03+2] Revert "admin: move jiji to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1143489 (owner: 10Effie Mouzeli) [16:39:04] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1139947 (https://phabricator.wikimedia.org/T373752) (owner: 10Scott French) [16:39:17] (03CR) 10Scott French: [C:03+2] P:mediawiki::php: add uuid extension for PHP 8.1+ [puppet] - 10https://gerrit.wikimedia.org/r/1139947 (https://phabricator.wikimedia.org/T373752) (owner: 10Scott French) [16:40:10] (03Abandoned) 10Cwhite: prometheus: add more recording rules around editResponseTime [puppet] - 10https://gerrit.wikimedia.org/r/1144662 (https://phabricator.wikimedia.org/T391677) (owner: 10Cwhite) [16:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:44:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:45:40] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:46:04] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:46:05] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:46:44] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:46:46] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:28] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:47:37] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:47:44] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:47:45] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:48:08] (03CR) 10BCornwall: [C:03+2] admin: Add bwojtowicz to ML-related accesses [puppet] - 10https://gerrit.wikimedia.org/r/1144649 (https://phabricator.wikimedia.org/T393595) (owner: 10BCornwall) [16:48:11] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:48:12] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:48:51] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:49:05] (03CR) 10Hnowlan: [C:03+1] "lgtm from a changeprop perspective! The commit message says "bulk of" the load is handled - does that imply there's some left in changepro" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [16:49:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:49:39] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10817576 (10BCornwall) 0... [16:50:24] !log maintenance complete on msw2-eqiad [16:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:41] (03CR) 10Hnowlan: [C:03+1] P:mw::maint::purge_expired_userrights: purge_expired_global_rights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143199 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [16:52:08] FIRING: [3x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:55:10] RESOLVED: BFDdown: BFD session down between cr3-eqsin and fe80::6687:88ff:fef2:6d50 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:17] (03PS1) 10Btullis: Remove the Categories_Lag icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) [16:55:31] (03CR) 10Federico Ceratto: [C:03+1] "I'm seeing just code being moved in a different file (as described) so LGTM. I did *not* run the code or tests. If you need hands-on testi" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [16:55:46] (03PS2) 10CDanis: wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) [16:55:58] !log on going maintenance on msw2-codfw [16:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:23] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5538/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) (owner: 10Btullis) [16:57:07] (03PS2) 10Btullis: Remove the Categories_Lag icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) [16:57:42] (03PS1) 10Alexandros Kosiaris: function-evaluator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145267 (https://phabricator.wikimedia.org/T389375) [16:58:05] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5539/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) (owner: 10Btullis) [16:58:54] jouncebot: nowandnext [16:58:54] For the next 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1600) [16:58:54] In 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1700) [16:59:43] hnowlan: I have a couple of mw-cron migrations stacked up for the infra window. happy to share if you want to coordinate changes :) [17:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1700). [17:00:15] o/ [17:00:32] (03CR) 10DCausse: "yes, but should be very low, it's related to the "archive" index which is updated everytime a page gets deleted/restored:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [17:01:50] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1144637 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [17:01:56] (03CR) 10Scott French: [C:03+2] P:mw::maintenance::refreshlinks: migrate remaining shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144637 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [17:01:56] (03Abandoned) 10Bking: lvs: add script to check for L2 connectivity [puppet] - 10https://gerrit.wikimedia.org/r/1030185 (https://phabricator.wikimedia.org/T363702) (owner: 10Bking) [17:03:04] swfrench-wmf: thanks! I think everything I want to do will be better done tomorrow morning so I'll leave it for now [17:03:58] sounds good :) [17:04:27] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1073 to cirrussearch1073 [17:04:51] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:06:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:08:28] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1073 to cirrussearch1073 - bking@cumin2002" [17:09:01] np [17:09:23] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1073 to cirrussearch1073 - bking@cumin2002" [17:09:24] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1073 on all recursors [17:09:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1073 on all recursors [17:09:28] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1073 [17:09:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:09:50] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:10:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1073 [17:11:00] (03CR) 10Scott French: [C:03+2] P:mw::maint::purge_expired_userrights: purge_expired_userrights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143198 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:11:03] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1084 to cirrussearch1084 [17:11:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1073 to cirrussearch1073 [17:11:27] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:15:00] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1084 to cirrussearch1084 - bking@cumin2002" [17:15:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1084 to cirrussearch1084 - bking@cumin2002" [17:15:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:29] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1084 on all recursors [17:15:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1084 on all recursors [17:15:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1084 [17:16:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:16:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1073.eqiad.wmnet with OS bullseye [17:16:50] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1073 [17:16:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1073 [17:17:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1084 [17:17:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1084 to cirrussearch1084 [17:17:49] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:17:59] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:20:06] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1084.eqiad.wmnet with OS bullseye [17:20:10] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1084 [17:20:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1084 [17:20:11] (03PS1) 10Jdlrobson: Expand dark mode access for anons (May 2025 deployments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145273 (https://phabricator.wikimedia.org/T393386) [17:20:32] !log maintenance complete on all 3 switches [17:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:32] (03PS1) 10Jdlrobson: Add ArticleSummaries to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145274 (https://phabricator.wikimedia.org/T392520) [17:24:10] (03CR) 10Scott French: [C:03+2] P:mw::maint::purge_expired_userrights: purge_expired_global_rights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143199 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:24:33] (03PS2) 10Jdlrobson: Nearby should show file namespace on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133) [17:31:01] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:31:19] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1073.eqiad.wmnet with reason: host reimage [17:31:36] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:32:16] (03PS1) 10Ebernhardson: Revert^2 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1145276 [17:32:16] (03PS1) 10Ebernhardson: Revert^2 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1145277 [17:32:20] (03PS5) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [17:32:20] (03PS1) 10Ebernhardson: etcd data for search-{psi,omega} dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/1145278 (https://phabricator.wikimedia.org/T143553) [17:34:51] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1084.eqiad.wmnet with reason: host reimage [17:35:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1073.eqiad.wmnet with reason: host reimage [17:35:22] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1144692 (https://phabricator.wikimedia.org/T388535) (owner: 10Scott French) [17:35:23] (03CR) 10Scott French: [C:03+2] P:mw:maint:update_flaggedrev_stats: migrate to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144692 (https://phabricator.wikimedia.org/T388535) (owner: 10Scott French) [17:38:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1084.eqiad.wmnet with reason: host reimage [17:39:46] (03CR) 10BCornwall: [C:03+1] "Assuming sukhe's mention of the other record is fine, looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:41:34] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 264595 [17:41:38] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:41:49] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [17:41:52] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [17:42:07] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:43:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 264595 [17:52:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10817962 (10cmooney) 05Open→03Resolved a:03cmooney Super @Jhancock.wm that all looks good now and links are working :) ` c... [17:54:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10817971 (10Jclark-ctr) [17:54:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10817972 (10Jclark-ctr) [17:58:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1073.eqiad.wmnet with OS bullseye [18:00:06] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T1800). [18:02:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1084.eqiad.wmnet with OS bullseye [18:02:25] FIRING: SystemdUnitFailed: isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:13] (03PS3) 10CDanis: wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) [18:07:06] (03PS4) 10CDanis: wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) [18:11:14] (03PS3) 10Btullis: Remove the Categories_Lag icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) [18:11:53] (03CR) 10Btullis: Remove the Categories_Lag icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) (owner: 10Btullis) [18:12:42] (03PS1) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [18:12:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [18:13:46] (03CR) 10CI reject: [V:04-1] cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [18:14:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5540/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) (owner: 10Btullis) [18:22:01] (03PS2) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [18:27:25] RESOLVED: SystemdUnitFailed: isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:25] (03PS1) 10Andrea Denisse: pyrra: Add labels to the varnish-combined SLO alert [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) [18:29:25] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1145288/5541/" [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) (owner: 10Andrea Denisse) [18:35:33] (03PS4) 10Scott French: P:mw::maint::backfill_localaccounts: backfillLocalAccounts-loginwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143226 (https://phabricator.wikimedia.org/T385866) [18:35:36] (03PS4) 10Scott French: P:mw::maint::backfill_localaccounts: backfillLocalAccounts-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143227 (https://phabricator.wikimedia.org/T385866) [18:44:53] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [18:45:19] (03CR) 10Ssingh: [C:03+1] "I compared PS5 with the current one and I am fine with you merging this fwiw. You can wait for Valentin's review as well." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [18:52:34] (03CR) 10BCornwall: "I imagine that all SLOs with multiple ratio sections should have `slo_component` added, shouldn't they?" [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) (owner: 10Andrea Denisse) [18:54:03] (03CR) 10Ecarg: [C:03+1] "thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [19:04:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:10:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1076 to codfw - jhancock@cumin2002" [19:10:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1076 to codfw - jhancock@cumin2002" [19:10:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:48] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1076 [19:10:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1076 [19:11:09] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10818171 (10BCornwall) @cmassaro You shouldn't need to do this again if you transfer your key (though you're very welcome to do this again for security's sake!) Right now we need to conf... [19:11:23] (03PS1) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:11:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [19:11:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1076.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:12:33] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:15:03] jhancock@cumin2002 provision (PID 2579901) is awaiting input [19:15:25] (03PS2) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:17:25] !log Import Varnish 7.1.1-2~bpo11+wmf1 into bullseye-wikimedia (T394004) [19:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:18:53] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqiad [19:21:07] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Dummy passwords for upcoming octavia install [labs/private] - 10https://gerrit.wikimedia.org/r/1145262 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:21:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1076.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1018.eqiad.wmnet with OS bookworm [19:21:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10818205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm [19:21:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1258.eqiad.wmnet with OS bookworm [19:22:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10818206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm [19:22:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:23:22] (03PS3) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:23:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:23:56] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqiad [19:24:13] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [19:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:27:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:28:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [19:30:13] (03PS4) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:30:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:31:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1076.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:11] (03PS1) 10BCornwall: Revert "hiera: lvs3009: set lower priority (depool)" [puppet] - 10https://gerrit.wikimedia.org/r/1145317 [19:32:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:32:41] (03CR) 10Ssingh: [C:03+1] Revert "hiera: lvs3009: set lower priority (depool)" [puppet] - 10https://gerrit.wikimedia.org/r/1145317 (owner: 10BCornwall) [19:33:22] (03CR) 10BCornwall: [C:03+2] Revert "hiera: lvs3009: set lower priority (depool)" [puppet] - 10https://gerrit.wikimedia.org/r/1145317 (owner: 10BCornwall) [19:34:36] (03PS5) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:35:40] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:37:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1018.eqiad.wmnet with reason: host reimage [19:37:20] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1258.eqiad.wmnet with reason: host reimage [19:37:22] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009*} and A:liberica (T393616) [19:37:25] T393616: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616 [19:37:42] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009*} and A:liberica (T393616) [19:40:29] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [19:40:30] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_magru [19:40:32] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [19:40:40] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru [19:40:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1018.eqiad.wmnet with reason: host reimage [19:41:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:44:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1076.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1258.eqiad.wmnet with reason: host reimage [19:45:38] (03CR) 10Andrea Denisse: [V:03+1] "Thanks for the review and great point! I agree that other SLOs using multiple ratio metrics could run into the same issue. That said, I th" [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) (owner: 10Andrea Denisse) [19:45:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1076'] [19:46:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1076'] [19:47:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1076.eqiad.wmnet with OS bookworm [19:47:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudvirt1076.eqiad.wmnet with OS bookworm [19:48:48] (03PS6) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:49:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:52:26] (03PS7) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:55:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1018.eqiad.wmnet with OS bookworm [19:55:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10818313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm executed with errors: - pc1018 (**FAIL**)... [19:57:25] (03CR) 10BCornwall: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) (owner: 10Andrea Denisse) [19:57:46] (03PS8) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:57:46] (03PS1) 10Andrew Bogott: Openstack config templates: move [keystone_authtoken] out of common template [puppet] - 10https://gerrit.wikimedia.org/r/1145321 [19:58:00] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145321 (owner: 10Andrew Bogott) [19:59:39] (03PS2) 10Andrew Bogott: Openstack config templates: move [keystone_authtoken] out of common template [puppet] - 10https://gerrit.wikimedia.org/r/1145321 [19:59:39] (03PS9) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:59:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145321 (owner: 10Andrew Bogott) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and thcipriani: That opportune time for a UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T2000). [20:00:05] bvibber, James_F, and kimberly_sarabia: A patch you scheduled for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [20:00:10] Heya. [20:00:15] I can deploy. [20:00:15] heyo [20:00:20] Or thcipriani can. [20:00:28] hello [20:00:33] Deployment party!!! [20:00:41] Should I use SpiderPig to show off? [20:00:46] well, we were doing a spiderpig show off time :) happy to deploy as part of that [20:00:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:00:50] You should. [20:00:57] Ack. [20:01:01] bvibber: You here yet? [20:01:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:01:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1258.eqiad.wmnet with OS bookworm [20:01:08] o/ [20:01:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10818317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1258.eqiad.wmnet with OS bookworm completed: - db1258 (**PASS**) - Removed... [20:01:18] Awesome, you're up first. [20:01:36] i have a mediawiki bit which is in the queue :D and a service update which i can do when we're done with anything else on the list [20:01:39] James_F: thx! [20:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144638 (https://phabricator.wikimedia.org/T393377) (owner: 10Jdlrobson) [20:02:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [20:02:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1018.eqiad.wmnet with OS bookworm [20:02:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10818318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm [20:02:56] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:16] (03PS1) 10Bvibber: Update chart-renderer in production to current [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145323 [20:04:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1018.eqiad.wmnet with reason: host reimage [20:05:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [20:06:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818324 (10Jhancock.wm) [20:08:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1018.eqiad.wmnet with reason: host reimage [20:10:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:10:45] (03Merged) 10jenkins-bot: Update to echarts 5.6.0 [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144638 (https://phabricator.wikimedia.org/T393377) (owner: 10Jdlrobson) [20:11:09] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1144638|Update to echarts 5.6.0 (T393377)]] [20:11:12] T393377: Upgrade Charts from eCharts 5.5.1 to 5.6.0 - https://phabricator.wikimedia.org/T393377 [20:14:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1075 to codfw - jhancock@cumin2002" [20:14:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1075 to codfw - jhancock@cumin2002" [20:14:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:14:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1075 [20:14:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1075 [20:15:48] !log jforrester@deploy1003 jforrester, jdlrobson: Backport for [[gerrit:1144638|Update to echarts 5.6.0 (T393377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:49] (03Abandoned) 10Andrea Denisse: pyrra: Add labels to the varnish-combined SLO alert [puppet] - 10https://gerrit.wikimedia.org/r/1145288 (https://phabricator.wikimedia.org/T394080) (owner: 10Andrea Denisse) [20:16:00] !log jforrester@deploy1003 jforrester, jdlrobson: Continuing with sync [20:16:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1075.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:17:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10818349 (10Jclark-ctr) [20:17:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10818360 (10Jclark-ctr) 05Open→03Resolved [20:18:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1018.eqiad.wmnet with OS bookworm [20:18:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10818362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1018.eqiad.wmnet with OS bookworm completed: - pc1018 (**PASS**) - Downtim... [20:18:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Requesting access to for  - https://phabricator.wikimedia.org/T393066#10818363 (10SCampos-WMF) Thank you, @BTullis just confirming that everything is work... [20:20:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10818377 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr [20:21:24] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:21:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:22:45] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144638|Update to echarts 5.6.0 (T393377)]] (duration: 11m 36s) [20:22:48] T393377: Upgrade Charts from eCharts 5.5.1 to 5.6.0 - https://phabricator.wikimedia.org/T393377 [20:23:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [20:23:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:24:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:24:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1076.eqiad.wmnet with OS bookworm [20:24:35] (03Merged) 10jenkins-bot: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [20:24:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudvirt1076.eqiad.wmnet with OS bookworm complet... [20:24:56] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1129958|Stream registration for article summaries (T389097 T387406)]] [20:25:01] T389097: [summaries] Create new data collection stream for summaries mobile pilot - https://phabricator.wikimedia.org/T389097 [20:25:01] T387406: [summaries] Create instrument + wire-up summary UI - https://phabricator.wikimedia.org/T387406 [20:25:10] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:27:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:28:13] (03CR) 10Bking: "Sorry for the late reply, I haven't looked at this in awhile. I thought there was a problem with categories metrics, but either that's bee" [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [20:28:31] (03CR) 10Bking: [C:03+2] Remove the Categories_Lag icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1145265 (https://phabricator.wikimedia.org/T374916) (owner: 10Btullis) [20:29:39] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [20:29:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10818427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [20:31:22] !log jforrester@deploy1003 ksarabia, jforrester: Backport for [[gerrit:1129958|Stream registration for article summaries (T389097 T387406)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:31:26] T389097: [summaries] Create new data collection stream for summaries mobile pilot - https://phabricator.wikimedia.org/T389097 [20:31:27] T387406: [summaries] Create instrument + wire-up summary UI - https://phabricator.wikimedia.org/T387406 [20:31:27] !log jforrester@deploy1003 ksarabia, jforrester: Continuing with sync [20:33:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for thiemowmde - https://phabricator.wikimedia.org/T393798#10818440 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [20:35:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for thiemowmde - https://phabricator.wikimedia.org/T393798#10818444 (10BCornwall) [20:36:35] (03PS1) 10BCornwall: admin: Add thiemowmde to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1145325 (https://phabricator.wikimedia.org/T393798) [20:38:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129958|Stream registration for article summaries (T389097 T387406)]] (duration: 13m 12s) [20:38:13] T389097: [summaries] Create new data collection stream for summaries mobile pilot - https://phabricator.wikimedia.org/T389097 [20:38:13] T387406: [summaries] Create instrument + wire-up summary UI - https://phabricator.wikimedia.org/T387406 [20:38:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [20:39:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10818464 (10BCornwall) [20:39:46] (03Merged) 10jenkins-bot: Remove web_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [20:40:06] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1143625|Remove web_ab_test_enrollment schema (T386247)]] [20:40:10] T386247: Finish cleaning up WebABTestEnrollment - https://phabricator.wikimedia.org/T386247 [20:40:18] jhancock@cumin2002 provision (PID 2610600) is awaiting input [20:40:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10818468 (10BCornwall) [20:41:35] (03PS1) 10BCornwall: admin: Add esanders to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1145327 (https://phabricator.wikimedia.org/T393724) [20:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:42:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1075.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:46:48] !log jforrester@deploy1003 bwang, jforrester: Backport for [[gerrit:1143625|Remove web_ab_test_enrollment schema (T386247)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:52] T386247: Finish cleaning up WebABTestEnrollment - https://phabricator.wikimedia.org/T386247 [20:46:56] !log jforrester@deploy1003 bwang, jforrester: Continuing with sync [20:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:20] (03PS1) 10Santiago Faci: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145328 (https://phabricator.wikimedia.org/T390036) [20:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:53:43] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143625|Remove web_ab_test_enrollment schema (T386247)]] (duration: 13m 36s) [20:53:47] T386247: Finish cleaning up WebABTestEnrollment - https://phabricator.wikimedia.org/T386247 [20:53:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1075'] [20:54:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145235 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [20:54:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145234 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [20:54:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1075'] [20:54:59] (03CR) 10Jforrester: [C:03+1] Update chart-renderer in production to current [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145323 (owner: 10Bvibber) [20:55:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10818501 (10Jclark-ctr) [20:55:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10818502 (10Jclark-ctr) @Stevemunene replaced the drives on these 2 servers [20:55:44] (03CR) 10Bvibber: [C:03+2] Update chart-renderer in production to current [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145323 (owner: 10Bvibber) [20:55:49] (03CR) 10Clare Ming: [C:03+2] Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145328 (https://phabricator.wikimedia.org/T390036) (owner: 10Santiago Faci) [20:56:22] (03Merged) 10jenkins-bot: Register our magic vars, so the parser knows to ask us what their values are [extensions/WikiLambda] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145235 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [20:56:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1075.eqiad.wmnet with OS bookworm [20:56:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudvirt1075.eqiad.wmnet with OS bookworm [20:57:14] (03Merged) 10jenkins-bot: Update chart-renderer in production to current [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145323 (owner: 10Bvibber) [20:57:18] (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145328 (https://phabricator.wikimedia.org/T390036) (owner: 10Santiago Faci) [20:57:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818517 (10Jhancock.wm) [20:57:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10818519 (10VRiley-WMF) [20:58:52] !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:59:49] !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250513T2100) [21:00:16] !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [21:00:28] (03Merged) 10jenkins-bot: Register our magic vars, so the parser knows to ask us what their values are [extensions/WikiLambda] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145234 (https://phabricator.wikimedia.org/T345477) (owner: 10Jforrester) [21:00:53] !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [21:00:55] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1145235|Register our magic vars, so the parser knows to ask us what their values are (T345477)]], [[gerrit:1145234|Register our magic vars, so the parser knows to ask us what their values are (T345477)]] [21:01:00] T345477: Counter of number of functions for a WikiLambda installation - https://phabricator.wikimedia.org/T345477 [21:01:02] !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [21:01:32] !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [21:07:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10818540 (10Jclark-ctr) [21:07:35] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1145235|Register our magic vars, so the parser knows to ask us what their values are (T345477)]], [[gerrit:1145234|Register our magic vars, so the parser knows to ask us what their values are (T345477)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:39] !log jforrester@deploy1003 jforrester: Continuing with sync [21:07:40] T345477: Counter of number of functions for a WikiLambda installation - https://phabricator.wikimedia.org/T345477 [21:09:30] (03PS1) 10BCornwall: admin: Add skivlehan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1145333 (https://phabricator.wikimedia.org/T393626) [21:11:58] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10818547 (10BCornwall) Hi, @SKivlehan! Who's your manager? We'll need them to comment on here with their approval. Thanks! [21:12:17] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10818553 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [21:12:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [21:14:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145235|Register our magic vars, so the parser knows to ask us what their values are (T345477)]], [[gerrit:1145234|Register our magic vars, so the parser knows to ask us what their values are (T345477)]] (duration: 13m 13s) [21:14:12] T345477: Counter of number of functions for a WikiLambda installation - https://phabricator.wikimedia.org/T345477 [21:16:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [21:16:05] (03PS1) 10Bking: cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) [21:22:23] (03PS2) 10Bking: cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) [21:22:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:28:42] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10818586 (10BCornwall) a:03SKivlehan-WMF [21:29:45] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [21:29:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10818591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [21:30:29] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10818593 (10SKivlehan-WMF) Hey, @BCornwall! My manager is @spatton -- Sam, can you take a look at this when you have a chance? Thank you both! [21:30:53] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10818596 (10BCornwall) a:05SKivlehan-WMF→03spatton [21:31:34] (03PS3) 10Ryan Kemper: cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:31:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:33:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:35:34] (03PS4) 10Ryan Kemper: cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:35:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:36:37] (03PS5) 10Ryan Kemper: cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:36:51] jhancock@cumin2002 reimage (PID 2629230) is awaiting input [21:39:22] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:40:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:42:30] (03CR) 10Bking: [C:03+2] cirrussearch: add cirrussearch row A/remove elastic row B [puppet] - 10https://gerrit.wikimedia.org/r/1145334 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [21:42:33] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [21:42:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for apus-be1004 - jclark@cumin1002" [21:42:55] jclark@cumin1002 provision (PID 1881180) is awaiting input [21:43:00] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [21:43:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for apus-be1004 - jclark@cumin1002" [21:43:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:43:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:44:17] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:45:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818632 (10VRiley-WMF) [21:45:39] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1007.eqiad.wmnet with reason: host reimage [21:47:19] (03PS6) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [21:47:51] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudvirt1068 - vriley@cumin1002" [21:47:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudvirt1068 - vriley@cumin1002" [21:47:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:48:12] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1068 [21:48:18] (03PS2) 10Ladsgroup: Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) [21:48:21] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1068 [21:48:29] (03CR) 10CI reject: [V:04-1] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:49:09] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:49:20] (03CR) 10CI reject: [V:04-1] Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [21:49:49] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1007.eqiad.wmnet with reason: host reimage [21:53:02] (03PS7) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [21:53:38] 10ops-codfw, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108 (10Papaul) 03NEW [21:53:43] 10ops-codfw, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108#10818673 (10Papaul) p:05Triage→03Medium [21:55:23] 10ops-codfw, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108#10818678 (10Papaul) [22:00:46] 10ops-eqiad, 06DC-Ops: Reboot of in rack mgmt switches in eqiad - https://phabricator.wikimedia.org/T394109 (10Papaul) 03NEW [22:05:38] (03PS1) 10Cwhite: logstash: create partition for ml logs [puppet] - 10https://gerrit.wikimedia.org/r/1145339 (https://phabricator.wikimedia.org/T390215) [22:07:56] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:09:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:09:54] vriley@cumin1002 provision (PID 1882059) is awaiting input [22:10:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:11:02] vriley@cumin1002 reimage (PID 1877826) is awaiting input [22:13:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:13:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1007.eqiad.wmnet with OS bullseye [22:14:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10818709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye c... [22:22:36] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on elastic1062:9290 - https://phabricator.wikimedia.org/T393657#10818725 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr no errors on prometheus at this time both psu have lights [22:27:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:27:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1075.eqiad.wmnet with OS bookworm [22:27:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10818731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudvirt1075.eqiad.wmnet with OS bookworm complet... [22:28:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:21] (03PS11) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [22:39:29] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [22:40:09] (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [22:40:49] 10ops-codfw, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108#10818761 (10Jhancock.wm) rebooted everything listed except the F4 server (no traffic). a mgmt IP pings in every rack. leaving open so i don't forget to reset the one in F4 when i get the key in... [22:43:26] (03CR) 10Cwhite: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [22:43:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [22:45:28] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_magru [22:45:57] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru [22:48:35] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [22:49:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:52:01] 10ops-codfw, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108#10818798 (10Papaul) @Jhancock.wm thank you [22:53:47] (03PS12) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [22:56:33] (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [22:58:13] (03CR) 10DLynch: [C:03+1] admin: Add esanders to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1145327 (https://phabricator.wikimedia.org/T393724) (owner: 10BCornwall) [22:59:40] (03PS13) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [23:00:08] (03PS1) 10Jforrester: UserInfo: Conditionally register the REST API route [extensions/CheckUser] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145345 (https://phabricator.wikimedia.org/T394070) [23:00:16] (03PS1) 10Jforrester: UserInfo: Conditionally register the REST API route [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145346 (https://phabricator.wikimedia.org/T394070) [23:10:26] 10ops-eqiad, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in eqiad - https://phabricator.wikimedia.org/T394109#10818904 (10VRiley-WMF) All these switches have been rebooted. [23:11:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:12:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:16:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:29:20] vriley@cumin1002 provision (PID 1882059) is awaiting input [23:30:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:31:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10818920 (10Jclark-ctr) @MatthewVernon @Papaul is this might be the first server with a boss card they come with a predefined RAID 1 usually are we leaving it or trying to ch... [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1145352 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1145352 (owner: 10TrainBranchBot) [23:46:07] (03PS14) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [23:51:30] (03PS15) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [23:53:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1145352 (owner: 10TrainBranchBot)