[00:01:58] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:09] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:35:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:14:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:39:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [03:23:26] FIRING: [2x] ProbeDown: Service wdqs1023:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:58] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:09] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:59:50] !log Deploy schema change in x1 eqiad (with replication) dbmaint T394509 [04:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:53] T394509: Drop unused event_type (varbinary) field and replace with event_types(integer) field in production - https://phabricator.wikimedia.org/T394509 [05:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 T394508', diff saved to https://phabricator.wikimedia.org/P76280 and previous config saved to /var/cache/conftool/dbconfig/20250519-051224-marostegui.json [05:12:28] T394508: Create a new candidate master for s5 - https://phabricator.wikimedia.org/T394508 [05:12:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1210.eqiad.wmnet with reason: Maintenance [05:18:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76281 and previous config saved to /var/cache/conftool/dbconfig/20250519-051832-root.json [05:33:26] FIRING: [4x] ProbeDown: Service wdqs1023:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:33:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76282 and previous config saved to /var/cache/conftool/dbconfig/20250519-053338-root.json [05:42:33] !log uploaded openjdk-8 8u452-ga-1~deb12u1 to component/jdk8 for bookworm-wikimedia [05:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76283 and previous config saved to /var/cache/conftool/dbconfig/20250519-054844-root.json [06:00:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76284 and previous config saved to /var/cache/conftool/dbconfig/20250519-060349-root.json [06:09:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1183 T394507', diff saved to https://phabricator.wikimedia.org/P76285 and previous config saved to /var/cache/conftool/dbconfig/20250519-060946-marostegui.json [06:09:50] T394507: decommission db1183 - https://phabricator.wikimedia.org/T394507 [06:14:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:17:04] !log installing openjdk-8 security updates [06:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76286 and previous config saved to /var/cache/conftool/dbconfig/20250519-061855-root.json [06:25:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:37] !log installing Java 21 security updates [06:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76287 and previous config saved to /var/cache/conftool/dbconfig/20250519-063400-root.json [06:35:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:46:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:47:19] !log jmm@dns1004 START - running authdns-update [06:47:58] !log jmm@dns1004 END - running authdns-update [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T0700). [07:00:05] isaranto: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] o/ [07:00:28] I can try deploying with Spiderpig [07:02:18] Amir1: urbanecm shall I give it a go? [07:11:06] !log updated bootimage for Bookworm to 12.11 T394489 [07:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:10] T394489: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489 [08:01:58] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:51] I am going to start a second mysql process in db1204. That will send an intentional alert. please ignore [08:18:38] !log testing paging status with db1204 [08:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:35] should be here anytime [08:19:44] ^ jayme [08:20:10] uhg, it is very slow [08:20:57] ah no, it is not hard down yet [08:21:19] 2/3 [08:21:38] acked [08:21:45] but not shown on IRC yet [08:22:21] did you get it, jayme ? [08:23:11] !incidents [08:23:12] 6143 (ACKED) db1204/mysqld processes (paged) [08:23:12] 6141 (RESOLVED) RESTGatewayBackendErrorsHigh sre (mobileapps_cluster rest-gateway eqiad) [08:23:52] so it didn't show on IRC, that's bad [08:28:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1176.eqiad.wmnet with reason: Maintenance [08:28:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2230.codfw.wmnet with reason: Maintenance [08:28:56] !log Install 10.6.22 on db1176 and db2230 testing hosts T394623 [08:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:59] T394623: MariaDB 10.6.22 released - https://phabricator.wikimedia.org/T394623 [08:33:57] (03PS1) 10Clément Goubert: alertmanager: Use alert summary as title for most tasks [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) [08:34:03] jouncebot: nowandnext [08:34:03] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [08:34:03] In 1 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1000) [08:34:09] (03CR) 10Elukey: [C:03+1] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [08:34:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491 (owner: 10Majavah) [08:35:03] (03CR) 10Brouberol: [C:03+2] deployment_server: install airflow-devenv on the host [puppet] - 10https://gerrit.wikimedia.org/r/1147692 (https://phabricator.wikimedia.org/T394038) (owner: 10Brouberol) [08:35:10] (03CR) 10Brouberol: [C:03+2] Do not monitor the airflow-dev namespace [alerts] - 10https://gerrit.wikimedia.org/r/1147695 (https://phabricator.wikimedia.org/T394491) (owner: 10Brouberol) [08:35:36] (03Merged) 10jenkins-bot: Do not show thumbnails or descriptions on Wikitech search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491 (owner: 10Majavah) [08:36:04] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1146491|Do not show thumbnails or descriptions on Wikitech search]] [08:38:16] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632 (10LSobanski) 03NEW [08:40:34] (03CR) 10Brouberol: [C:03+1] Add a copy of the dump scripts that are in puppet [dumps] - 10https://gerrit.wikimedia.org/r/1147028 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [08:42:50] (03PS24) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [08:45:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [08:47:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:54] (03PS1) 10Clément Goubert: mediawiki: Add startingDeadlineSeconds to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147709 (https://phabricator.wikimedia.org/T394423) [08:51:18] !log taavi@deploy1003 taavi: Backport for [[gerrit:1146491|Do not show thumbnails or descriptions on Wikitech search]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:51:54] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@472cc1c]: T393559 [08:51:57] T393559: Data pipeline to load cx_translations to Data Lake, at wmf_product - https://phabricator.wikimedia.org/T393559 [08:52:09] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:52:25] !log taavi@deploy1003 taavi: Continuing with sync [08:52:56] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@472cc1c]: T393559 (duration: 01m 14s) [08:52:56] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet [08:53:53] (03CR) 10Vgutierrez: "text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:53:59] 06SRE, 10Page Content Service: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833746 (10Jgiannelos) This error doesn't come from mobileapps specific codebase but rather from the web framework directly. Should we put extra handling just for t... [08:54:40] !log joal@deploy1003 Started deploy [airflow-dags/analytics@4ebb376]: Add new artifact to Airflow cache [08:54:45] 06SRE, 10Page Content Service: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833747 (10Jgiannelos) Can we maybe handle it in rest-gateway envoy level? [08:54:47] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@4ebb376]: Add new artifact to Airflow cache (duration: 00m 07s) [08:55:10] !log joal@deploy1003 Started deploy [airflow-dags/analytics@536dc9e]: Add new artifact to Airflow cache (after git pull ...) [08:55:48] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@536dc9e]: Add new artifact to Airflow cache (after git pull ...) (duration: 00m 38s) [08:58:54] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet [08:59:07] (03PS1) 10Clément Goubert: mw::periodic_job: add startingdeadlineseconds [puppet] - 10https://gerrit.wikimedia.org/r/1147710 (https://phabricator.wikimedia.org/T394423) [09:01:54] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146491|Do not show thumbnails or descriptions on Wikitech search]] (duration: 25m 50s) [09:04:16] (03PS1) 10Muehlenhoff: maps/osm: Move Kartotherian SQL config into Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) [09:05:21] (03CR) 10CI reject: [V:04-1] maps/osm: Move Kartotherian SQL config into Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:07:10] (03PS1) 10Elukey: Remove Lift Wing related dashboards from Grizzly [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1147713 (https://phabricator.wikimedia.org/T387350) [09:07:49] (03PS2) 10Muehlenhoff: maps/osm: Move Kartotherian SQL config into Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) [09:09:55] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640 (10Gehel) 03NEW [09:10:12] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10833819 (10Gehel) p:05Triage→03High [09:11:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:13:09] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10833821 (10Gehel) [09:13:09] (03CR) 10Elukey: "looks good! I think there is a typo in a sql filename, except from that you are goot to go." [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:14:20] 06SRE, 10Page Content Service: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833823 (10hnowlan) It appears that this isn't only broken encoding, it's just wildly invalid data being queried with additional parameters etc. The gateway itself... [09:16:16] (03PS1) 10Hnowlan: sre:rest-gateway: rename api gateway alert, disable paging [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) [09:16:28] (03PS3) 10Muehlenhoff: maps/osm: Move Kartotherian SQL config into Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) [09:16:45] (03CR) 10Muehlenhoff: maps/osm: Move Kartotherian SQL config into Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:16:57] !log downtime of elastic alerts T394640 [09:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:00] T394640: Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640 [09:17:17] !log installing net-tools security updates [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:39] (03PS2) 10Hnowlan: sre:rest-gateway: rename api gateway alert, disable paging [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) [09:18:51] (03CR) 10CI reject: [V:04-1] sre:rest-gateway: rename api gateway alert, disable paging [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) (owner: 10Hnowlan) [09:19:06] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10833852 (10jcrespo) I've downtimed 30 or so alerts on @Gehel's suggestion for 48 hours (only delayed the noise, it will reap... [09:20:07] (03PS3) 10Hnowlan: sre:rest-gateway: rename api gateway alert, disable paging [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) [09:20:10] ^ jayme acked 60 alerts, making the dashboard more clean [09:20:21] thanks [09:23:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:26:30] (03CR) 10Elukey: [C:03+1] maps/osm: Move Kartotherian SQL config into Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:33:26] FIRING: [4x] ProbeDown: Service wdqs1023:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:28] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10833916 (10Stevemunene) a:03Stevemunene [09:33:41] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10833917 (10Stevemunene) [09:36:21] (03CR) 10Muehlenhoff: "Yiannis/Mateus: FYI so that future changes within the Kartotherian repo also get synched over to the server side for future imports." [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:37:16] (03CR) 10Muehlenhoff: [C:03+2] maps/osm: Move Kartotherian SQL config into Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1147712 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:41:38] (03CR) 10Btullis: [C:03+2] Add a copy of the dump scripts that are in puppet [dumps] - 10https://gerrit.wikimedia.org/r/1147028 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [09:44:09] 06SRE, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833982 (10Jgiannelos) [09:44:25] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:37] 06SRE, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833984 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [09:44:48] 06SRE, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10833990 (10Jgiannelos) 05Resolved→03Open [09:47:43] (03PS1) 10Brouberol: airflow-platform-eng: increase DAG processing timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147723 (https://phabricator.wikimedia.org/T394459) [09:47:44] (03PS1) 10Brouberol: airflow-platform-eng: pull the main airflow-dags folder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) [09:47:50] (03CR) 10Volans: [C:03+2] Add support for trixie [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1145833 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [09:48:38] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: pull the main airflow-dags folder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:49:00] (03CR) 10Joal: airflow-platform-eng: pull the main airflow-dags folder (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:49:09] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: increase DAG processing timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147723 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:49:29] (03PS2) 10Brouberol: airflow-platform-eng: pull the main airflow-dags folder instead of analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) [09:49:45] (03CR) 10Brouberol: airflow-platform-eng: pull the main airflow-dags folder instead of analytics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:49:47] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: pull the main airflow-dags folder instead of analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:49:54] (03CR) 10Joal: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:50:01] (03Merged) 10jenkins-bot: Add support for trixie [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1145833 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [09:50:25] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: increase DAG processing timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147723 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:50:33] (03CR) 10Joal: [C:03+1] airflow-platform-eng: increase DAG processing timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147723 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:50:35] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: pull the main airflow-dags folder instead of analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:50:56] (03CR) 10Marostegui: [C:03+2] mariadb: Move s5 to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1147424 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [09:51:14] (03CR) 10Marostegui: [C:03+2] "This is a NOOP until we restart mariadb or change the option live." [puppet] - 10https://gerrit.wikimedia.org/r/1147424 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [09:52:13] (03Merged) 10jenkins-bot: airflow-platform-eng: increase DAG processing timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147723 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:52:19] (03PS2) 10Ladsgroup: tables-catalog: Add existencelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1146954 (https://phabricator.wikimedia.org/T14019) [09:52:20] (03Merged) 10jenkins-bot: airflow-platform-eng: pull the main airflow-dags folder instead of analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147724 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [09:52:58] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add existencelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1146954 (https://phabricator.wikimedia.org/T14019) (owner: 10Ladsgroup) [09:53:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:54:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:55:45] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10834041 (10elukey) I found https://github.com/pyrra-dev/pyrra/pull/969 that implements sort-of what I proposed above, the PR is old-ish and it s... [09:57:35] (03PS1) 10Joal: dse-k8s-services: Fix airflow analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) [09:59:06] (03CR) 10Brouberol: dse-k8s-services: Fix airflow analytics_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) (owner: 10Joal) [10:00:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1172 and db2164 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76291 and previous config saved to /var/cache/conftool/dbconfig/20250519-100000-ladsgroup.json [10:00:04] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1000) [10:00:28] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147729 [10:00:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:52] (03PS2) 10Joal: dse-k8s-services: Fix airflow analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) [10:01:11] (03CR) 10Joal: dse-k8s-services: Fix airflow analytics_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) (owner: 10Joal) [10:01:36] (03CR) 10Brouberol: [C:03+1] dse-k8s-services: Fix airflow analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) (owner: 10Joal) [10:01:40] (03CR) 10Brouberol: [C:03+2] dse-k8s-services: Fix airflow analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) (owner: 10Joal) [10:02:15] (03PS1) 10Muehlenhoff: maps/osm: Make create_layers_functions independent of Kartotherian checkout [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) [10:02:49] (03CR) 10Effie Mouzeli: [C:03+2] cache.mcrouter: upgrade to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141201 (https://phabricator.wikimedia.org/T393281) (owner: 10Effie Mouzeli) [10:03:25] (03Merged) 10jenkins-bot: dse-k8s-services: Fix airflow analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147727 (https://phabricator.wikimedia.org/T394015) (owner: 10Joal) [10:04:42] (03Merged) 10jenkins-bot: cache.mcrouter: upgrade to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141201 (https://phabricator.wikimedia.org/T393281) (owner: 10Effie Mouzeli) [10:05:53] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147729 (owner: 10PipelineBot) [10:07:38] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147729 (owner: 10PipelineBot) [10:09:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:10:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:10:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:11:09] o/ is it ok if I backport this change with Spiderpig https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1143638 ? It was scheduled for the morning window but didn't want to do it without anyone being around just in case cc: Amir1 [10:11:29] jouncebot: nowandnext [10:11:29] For the next 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1000) [10:11:29] In 2 hour(s) and 48 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1300) [10:11:47] that is a mw infra deploy window, let me see if anyone is using it [10:12:09] I ain't [10:12:25] ack, thanks! I saw nothing scheduled but yeah better ask :) [10:12:47] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:13:11] isaranto: we rarely schedule things in the mw-infra window, unless they're really disruptive, but we do use it regularly even if nothing is scheduled [10:13:23] (unsure if I make sense) [10:13:36] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:13:38] good to know -- it makes total sense [10:14:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:14:54] !log Move eqiad s5 replicas (except sanitarium master and backup sources) to SBR dbmaint T383795 [10:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:59] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [10:17:20] I am going to start a second mysql process in db1204; that will be an intentional alert- please ignore it [10:17:28] ^ jayme [10:17:41] ack [10:18:02] !log testing paging status with db1204 [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:57] (03PS2) 10Muehlenhoff: maps/osm: Make create_layers_functions independent of Kartotherian checkout [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) [10:20:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:20:30] PROBLEM - mysqld processes #page on db1204 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:20:39] tappof: it worked now [10:20:48] I see thank you jynus [10:21:03] no, thanks to you! [10:21:15] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:21:29] RECOVERY - mysqld processes #page on db1204 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:23:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) (owner: 10Cyndywikime) [10:26:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1178 and db2165 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76292 and previous config saved to /var/cache/conftool/dbconfig/20250519-102615-ladsgroup.json [10:26:20] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [10:30:38] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:31:36] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:32:11] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:42:25] (03CR) 10Elukey: [C:03+1] "Special prize for tech-debt cleanup :D" [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:46:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:50:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1226 and db2163 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76293 and previous config saved to /var/cache/conftool/dbconfig/20250519-105013-ladsgroup.json [10:50:18] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [10:54:06] (03PS1) 10Clément Goubert: zarcillo: Add mesh listener for noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147735 [10:56:09] (03PS1) 10Btullis: Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) [10:56:15] (03CR) 10Clément Goubert: [C:03+2] zarcillo: Add mesh listener for noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147735 (owner: 10Clément Goubert) [10:57:33] (03CR) 10CI reject: [V:04-1] Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) (owner: 10Btullis) [10:58:00] (03Merged) 10jenkins-bot: zarcillo: Add mesh listener for noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147735 (owner: 10Clément Goubert) [10:58:55] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:02:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:21] (03PS10) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:04:03] (03PS1) 10Klausman: aptrepo: Add two missing deps to thirdparty/rocm63 repo [puppet] - 10https://gerrit.wikimedia.org/r/1147739 (https://phabricator.wikimedia.org/T385173) [11:04:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:28] (03PS11) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:04:55] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5591/console" [puppet] - 10https://gerrit.wikimedia.org/r/1147739 (https://phabricator.wikimedia.org/T385173) (owner: 10Klausman) [11:07:35] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1147104 (https://phabricator.wikimedia.org/T363581) (owner: 10Ilias Sarantopoulos) [11:07:47] (03CR) 10Muehlenhoff: [C:03+2] maps/osm: Make create_layers_functions independent of Kartotherian checkout [puppet] - 10https://gerrit.wikimedia.org/r/1147730 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:07:56] (03PS1) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) [11:08:09] (03PS2) 10Vgutierrez: varnish: Track original host header value for cookie scope purposes [puppet] - 10https://gerrit.wikimedia.org/r/1147728 (https://phabricator.wikimedia.org/T367346) [11:08:10] (03CR) 10Vgutierrez: "text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1147728 (https://phabricator.wikimedia.org/T367346) (owner: 10Vgutierrez) [11:09:22] 10SRE-tools, 10database-backups, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692#10834300 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:09:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10834302 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:11:42] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [11:12:36] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147742 [11:13:23] (03CR) 10Clément Goubert: [C:03+1] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T394423) (owner: 10Hnowlan) [11:15:14] (03PS2) 10Btullis: Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) [11:18:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53940 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:20:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:20:37] (03CR) 10Hnowlan: [C:03+1] mediawiki: Add startingDeadlineSeconds to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147709 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:21:24] jouncebot: nowandnext [11:21:24] No deployments scheduled for the next 1 hour(s) and 38 minute(s) [11:21:24] In 1 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1300) [11:21:28] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add startingDeadlineSeconds to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147709 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:21:58] (03CR) 10Hnowlan: [C:03+1] alertmanager: Use alert summary as title for most tasks [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [11:23:26] FIRING: [4x] ProbeDown: Service wdqs1023:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:46] (03PS1) 10Marostegui: db1169: Migrate to MariaDb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1147744 (https://phabricator.wikimedia.org/T394653) [11:23:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T394653', diff saved to https://phabricator.wikimedia.org/P76294 and previous config saved to /var/cache/conftool/dbconfig/20250519-112356-marostegui.json [11:24:00] T394653: Test MariaDB 10.11.13 - https://phabricator.wikimedia.org/T394653 [11:24:06] (03Merged) 10jenkins-bot: mediawiki: Add startingDeadlineSeconds to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147709 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:24:29] (03CR) 10Kevin Bazira: [C:03+1] aptrepo: Add two missing deps to thirdparty/rocm63 repo [puppet] - 10https://gerrit.wikimedia.org/r/1147739 (https://phabricator.wikimedia.org/T385173) (owner: 10Klausman) [11:25:10] (03CR) 10Marostegui: [C:03+2] db1169: Migrate to MariaDb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1147744 (https://phabricator.wikimedia.org/T394653) (owner: 10Marostegui) [11:25:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:26:09] !log cgoubert@deploy1003 Started scap sync-world: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423 [11:26:13] T394423: Investigate startingDeadlineSeconds setting for kubernetes CronJobs - https://phabricator.wikimedia.org/T394423 [11:26:42] !log cgoubert@deploy1003 cgoubert: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:28:12] !log cgoubert@deploy1003 Finished scap sync-world: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423 (duration: 02m 16s) [11:29:16] (03CR) 10Hnowlan: [C:03+1] mw::periodic_job: add startingdeadlineseconds [puppet] - 10https://gerrit.wikimedia.org/r/1147710 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:29:59] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T394423) (owner: 10Hnowlan) [11:30:11] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: add startingdeadlineseconds [puppet] - 10https://gerrit.wikimedia.org/r/1147710 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:30:24] (03PS2) 10Clément Goubert: mw::periodic_job: add startingdeadlineseconds [puppet] - 10https://gerrit.wikimedia.org/r/1147710 (https://phabricator.wikimedia.org/T394423) [11:30:33] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: add startingdeadlineseconds [puppet] - 10https://gerrit.wikimedia.org/r/1147710 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [11:30:55] (03PS2) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) [11:30:59] (03CR) 10Máté Szabó: Update IPInfo access levels (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [11:32:49] 06SRE, 06Project-Admins: Disable #acl*sre_team workboard and update its project description - https://phabricator.wikimedia.org/T394654 (10Aklapper) 03NEW p:05Triage→03Low [11:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76295 and previous config saved to /var/cache/conftool/dbconfig/20250519-113639-root.json [11:37:45] Amir1: shall I give it a go now? [11:38:00] in a meeting rn, give me a bit [11:38:57] a ok! sorry for the ping(s) [11:40:35] (03CR) 10Federico Ceratto: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147735 (owner: 10Clément Goubert) [11:48:48] I have 12 minutes, let's go [11:49:03] isaranto: it should be noop, so you wanna do it? [11:49:11] yes! [11:49:53] starting! [11:51:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [11:51:30] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147742 (owner: 10PipelineBot) [11:51:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76296 and previous config saved to /var/cache/conftool/dbconfig/20250519-115146-root.json [11:52:33] (03Merged) 10jenkins-bot: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [11:52:48] !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1143638|Create dblist for ores extension (T391103)]] [11:52:51] T391103: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103 [11:52:56] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147742 (owner: 10PipelineBot) [11:54:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1167 and db2152 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76297 and previous config saved to /var/cache/conftool/dbconfig/20250519-115411-ladsgroup.json [11:54:15] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:55:13] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:55:33] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:55:41] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:56:13] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:56:21] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:56:51] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:56:54] !log isaranto@deploy1003 isaranto, jsn: Backport for [[gerrit:1143638|Create dblist for ores extension (T391103)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:58:19] !log isaranto@deploy1003 isaranto, jsn: Continuing with sync [11:58:38] (03CR) 10Kamila Součková: [C:03+1] "LGTM with an inline question" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [11:58:44] (03PS1) 10Clément Goubert: mw::maintenance: Set concurrency for listTaskCounts [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) [11:59:02] (03PS2) 10Effie Mouzeli: WIP: create allow-hostpath-mediawiki policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 [12:01:31] (03CR) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [12:01:47] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [12:01:54] (03PS1) 10Brouberol: Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) [12:01:58] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:00] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Set concurrency for listTaskCounts [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [12:02:16] (03PS2) 10Brouberol: Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) [12:02:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:04:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:05] (03CR) 10CI reject: [V:04-1] Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:05:15] !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143638|Create dblist for ores extension (T391103)]] (duration: 12m 27s) [12:05:19] T391103: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103 [12:06:09] (03PS2) 10Clément Goubert: mw::maintenance: Set concurrency for listTaskCounts [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) [12:06:23] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [12:06:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76298 and previous config saved to /var/cache/conftool/dbconfig/20250519-120651-root.json [12:06:56] Done! [12:08:58] fyi: we're also gonna run the following maintenance scrip to create the tables for the ores-exension on idwiki [12:09:00] `mwscript-k8s --comment="T382171" -- extensions/WikimediaMaintenance/createExtensionTables.php --wiki=idwiki ores` [12:09:01] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [12:10:28] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Set concurrency for listTaskCounts [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [12:10:48] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Set concurrency for listTaskCounts [puppet] - 10https://gerrit.wikimedia.org/r/1147745 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [12:11:42] (03PS12) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:12:32] (03PS13) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:12:53] (03CR) 10Klausman: [V:03+1 C:03+2] aptrepo: Add two missing deps to thirdparty/rocm63 repo [puppet] - 10https://gerrit.wikimedia.org/r/1147739 (https://phabricator.wikimedia.org/T385173) (owner: 10Klausman) [12:13:34] (03PS14) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:14:35] (03PS15) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:15:07] (03PS16) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:15:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [12:16:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd [12:17:37] done [12:18:14] (03PS3) 10Brouberol: Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) [12:18:17] (03PS1) 10Clément Goubert: mw:maintenance: Fix newlines in kubernetes periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1147754 [12:18:35] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:18:48] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:18:55] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147757 [12:21:42] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [12:21:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76299 and previous config saved to /var/cache/conftool/dbconfig/20250519-122157-root.json [12:29:16] (03CR) 10Btullis: Convert snapshot1017 into a dse-k8s-worker host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:30:21] (03CR) 10Brouberol: Convert snapshot1017 into a dse-k8s-worker host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:30:37] (03PS17) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:30:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [12:30:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd [12:33:42] (03CR) 10Andrew Bogott: [C:03+2] openstack::clientpackages::vms: remove pre-dalmatian apt sources [puppet] - 10https://gerrit.wikimedia.org/r/1147165 (https://phabricator.wikimedia.org/T394438) (owner: 10Andrew Bogott) [12:36:24] (03PS4) 10Brouberol: Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) [12:37:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76300 and previous config saved to /var/cache/conftool/dbconfig/20250519-123702-root.json [12:37:54] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [12:38:41] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:38:57] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:39:02] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147728 (https://phabricator.wikimedia.org/T367346) (owner: 10Vgutierrez) [12:39:31] (03CR) 10Btullis: [C:03+1] Convert snapshot1017 into a dse-k8s-worker host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:39:46] (03CR) 10Brouberol: [C:03+2] Convert snapshot1017 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:41:14] (03CR) 10Muehlenhoff: Convert snapshot1017 into a dse-k8s-worker host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:42:00] (03PS1) 10Marostegui: mariadb: Move db1183 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1147759 (https://phabricator.wikimedia.org/T394661) [12:42:03] (03CR) 10Brouberol: [C:03+2] Convert snapshot1017 into a dse-k8s-worker host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147746 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:42:27] (03CR) 10Muehlenhoff: [C:03+2] sshd: Remove ineffective configuration "Protocol" directive [puppet] - 10https://gerrit.wikimedia.org/r/1146989 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:42:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [12:42:57] (03PS4) 10Ladsgroup: mariadb: Add ores extension tables to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1147104 (https://phabricator.wikimedia.org/T363581) (owner: 10Ilias Sarantopoulos) [12:43:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179 T394661', diff saved to https://phabricator.wikimedia.org/P76301 and previous config saved to /var/cache/conftool/dbconfig/20250519-124302-marostegui.json [12:43:04] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add ores extension tables to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1147104 (https://phabricator.wikimedia.org/T363581) (owner: 10Ilias Sarantopoulos) [12:43:06] T394661: Move db1183 to x1 - https://phabricator.wikimedia.org/T394661 [12:43:13] (03PS1) 10Aqu: airflow-analytics-test: Increase scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) [12:43:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:43:26] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1183 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1147759 (https://phabricator.wikimedia.org/T394661) (owner: 10Marostegui) [12:43:46] Amir1: ok to merge? [12:44:04] marostegui: my table catalog patch ? yes [12:44:11] merging [12:44:50] (03PS2) 10Aqu: airflow-analytics-test: Increase scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) [12:45:06] (03PS1) 10Kamila Součková: mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) [12:45:21] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:46:58] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:36] (03PS3) 10Aqu: airflow-analytics-test: Raise scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) [12:47:51] (03CR) 10Dreamy Jazz: mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:47:52] !log brouberol@cumin2002 START - Cookbook sre.hosts.rename from snapshot1017 to dse-k8s-worker1010 [12:48:03] (03PS5) 10Andrew Bogott: cloud-vps: stop installing openstack package osbpos on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1147166 (https://phabricator.wikimedia.org/T394438) [12:48:04] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [12:49:01] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1179.eqiad.wmnet onto db1183.eqiad.wmnet [12:49:57] (03PS1) 10DCausse: Make weighted tags no longer be WMF-specific [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147762 (https://phabricator.wikimedia.org/T393872) [12:50:39] (03PS2) 10Kamila Součková: mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) [12:50:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147762 (https://phabricator.wikimedia.org/T393872) (owner: 10DCausse) [12:51:02] (03CR) 10Kamila Součková: mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:51:17] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1017 to dse-k8s-worker1010 - brouberol@cumin2002" [12:51:38] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1017 to dse-k8s-worker1010 - brouberol@cumin2002" [12:51:38] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:51:39] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1010 on all recursors [12:51:42] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1010 on all recursors [12:51:43] !log brouberol@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1010 [12:52:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76302 and previous config saved to /var/cache/conftool/dbconfig/20250519-125208-root.json [12:52:09] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:52:15] (03CR) 10Btullis: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:52:52] !log brouberol@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1010 [12:53:32] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1017 to dse-k8s-worker1010 [12:54:53] (03CR) 10Dreamy Jazz: [C:03+1] mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:55:08] (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from TSP team point of view." [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:55:45] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1300). [13:00:05] Cyndywikime and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] o/ [13:01:16] o/ [13:02:38] (03CR) 10Vgutierrez: [C:03+1] "list of FQDNs looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) (owner: 10Btullis) [13:02:59] (03PS1) 10DDesouza: Design Research survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147763 (https://phabricator.wikimedia.org/T394315) [13:03:45] I can deploy [13:04:21] Cyndywikime: going to ship you config change [13:04:50] okay :) [13:06:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147763 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [13:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76303 and previous config saved to /var/cache/conftool/dbconfig/20250519-130713-root.json [13:07:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) (owner: 10Cyndywikime) [13:08:23] (03Merged) 10jenkins-bot: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) (owner: 10Cyndywikime) [13:08:37] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1128828|Growth: Remove unused PHP config settings (T388787)]] [13:08:41] T388787: Remove now unused PHP config settings - https://phabricator.wikimedia.org/T388787 [13:08:58] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] notify_maintainers: ignore toolsbeta-tofu [puppet] - 10https://gerrit.wikimedia.org/r/1146952 (https://phabricator.wikimedia.org/T394453) (owner: 10Arturo Borrero Gonzalez) [13:10:45] (03CR) 10DCausse: [C:03+2] Make weighted tags no longer be WMF-specific [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147762 (https://phabricator.wikimedia.org/T393872) (owner: 10DCausse) [13:11:58] (03CR) 10Klausman: [C:03+1] Remove Lift Wing related dashboards from Grizzly [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1147713 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:12:37] !log dcausse@deploy1003 dcausse, cyndywikime: Backport for [[gerrit:1128828|Growth: Remove unused PHP config settings (T388787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:42] (03Merged) 10jenkins-bot: Make weighted tags no longer be WMF-specific [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147762 (https://phabricator.wikimedia.org/T393872) (owner: 10DCausse) [13:12:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1203 and db2162 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76304 and previous config saved to /var/cache/conftool/dbconfig/20250519-131254-ladsgroup.json [13:13:01] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [13:13:13] Cyndywikime: your patch is on mwdebug servers please test and let me know if it's working as expected :) [13:13:46] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:13:52] dcausse: testing [13:14:42] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:16:12] dcausse: all good!Thanks :) [13:16:20] Cyndywikime: ok, shipping [13:16:32] !log dcausse@deploy1003 dcausse, cyndywikime: Continuing with sync [13:16:38] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1010.eqiad.wmnet with reason: host reimage [13:17:48] (03PS1) 10Tiziano Fogli: ircecho: exit upon disconnection [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) [13:20:10] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1010.eqiad.wmnet with reason: host reimage [13:22:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76305 and previous config saved to /var/cache/conftool/dbconfig/20250519-132218-root.json [13:22:47] (03CR) 10Kamila Součková: [C:03+2] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [13:23:03] (03CR) 10Kamila Součková: [C:03+2] mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists [puppet] - 10https://gerrit.wikimedia.org/r/1147761 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:23:23] (03CR) 10Alexandros Kosiaris: [C:03+1] ircecho: exit upon disconnection [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [13:23:30] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1128828|Growth: Remove unused PHP config settings (T388787)]] (duration: 14m 53s) [13:23:34] T388787: Remove now unused PHP config settings - https://phabricator.wikimedia.org/T388787 [13:23:38] Cyndywikime: should be live [13:24:14] err, spiderpig reports an error... [13:24:52] (03CR) 10Kamila Součková: [C:03+1] am: mw-cron: Add Wikimedia-production-error tag [puppet] - 10https://gerrit.wikimedia.org/r/1147699 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:25:12] 🙃 [13:25:30] Could not resolve hostname snapshot1017.eqiad.wmnet: Name or service not known [13:25:39] I guess we can ignore? [13:25:52] that's the only host that failed to sync [13:26:07] I guess so ? [13:26:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1211 from s8, move db2162 from s8 to x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76306 and previous config saved to /var/cache/conftool/dbconfig/20250519-132610-ladsgroup.json [13:26:15] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [13:26:39] dcausse, Cyndywikime: cursory look at netbox suggests that host doesn't exist [13:26:39] I guess it's because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147746 [13:26:47] Raine: thanks! [13:26:57] yeah, that'd explain it :D [13:27:10] ok, shipping next patch [13:27:10] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 65 hosts with reason: eqiad is depooled, noisy alerts [13:27:14] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10834736 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=53680efe-51bd-4e69-a0cb-c69f7ffac3dd) set by bki... [13:27:35] that is the usual breakage/race condition between puppet/decom and scap relying on dsh group generated from the Puppet db [13:28:11] hashar@deploy1003:~$ grep -R snapshot1017 /etc/dsh [13:28:12] /etc/dsh/group/mediawiki-installation:snapshot1017.eqiad.wmnet [13:28:33] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1147762|Make weighted tags no longer be WMF-specific (T393872)]] [13:28:37] T393872: Make weighted tags no longer be WMF-specific - https://phabricator.wikimedia.org/T393872 [13:28:43] hashar: thanks, makes sense [13:28:54] (03CR) 10Elukey: [V:03+2 C:03+2] Remove Lift Wing related dashboards from Grizzly [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1147713 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:29:05] and Raine confirmed it is gone so it is all good :] [13:30:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:31:10] (03CR) 10Tiziano Fogli: "Looking at the code, it seems that reconnections are supposed to be handled directly within SingleServerIRCBot and ExponentialBackoff, but" [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [13:31:33] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10834783 (10bking) Apologies for the noise and thank you for bringing this to our attention. Since EQIAD is depooled, I've do... [13:32:07] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10834786 (10bking) [13:32:51] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1147762|Make weighted tags no longer be WMF-specific (T393872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:57] testing [13:33:39] (03CR) 10Clément Goubert: [C:03+2] am: mw-cron: Add Wikimedia-production-error tag [puppet] - 10https://gerrit.wikimedia.org/r/1147699 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:33:52] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Use alert summary as title for most tasks [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:34:02] (03PS2) 10Clément Goubert: alertmanager: Use alert summary as title for most tasks [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) [13:35:09] !log dcausse@deploy1003 dcausse: Continuing with sync [13:35:17] (03CR) 10Kamila Součková: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:35:41] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Use alert summary as title for most tasks [puppet] - 10https://gerrit.wikimedia.org/r/1147702 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:37:12] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [13:40:01] (03PS4) 10Aqu: airflow-analytics-test: Raise scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) [13:40:56] (03PS1) 10Brouberol: Convert snapshot1014 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147768 (https://phabricator.wikimedia.org/T394647) [13:42:02] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147762|Make weighted tags no longer be WMF-specific (T393872)]] (duration: 13m 28s) [13:42:06] jouncebot: now [13:42:06] For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1300) [13:42:06] T393872: Make weighted tags no longer be WMF-specific - https://phabricator.wikimedia.org/T393872 [13:42:46] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [13:43:01] do you think you could fit in a quick no-op config change for me in this window? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1147075 [13:43:28] MatmaRex: sure [13:43:36] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.015 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [13:43:51] thanks :) [13:44:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147075 (owner: 10Bartosz Dziewoński) [13:44:25] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:30] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:44:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147075 (owner: 10Bartosz Dziewoński) [13:45:03] (03Merged) 10jenkins-bot: Remove unused Echo 'notify-type-availability' config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147075 (owner: 10Bartosz Dziewoński) [13:45:22] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:45:23] MatmaRex: done :) [13:45:30] thanks a lot [13:45:57] !log closing the UTC afternoon backport window [13:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:44] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on backup1008 - https://phabricator.wikimedia.org/T394673 (10jcrespo) 03NEW [13:47:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:47:49] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:48:37] (03PS2) 10Muehlenhoff: sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) [13:48:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:51:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146013 (https://phabricator.wikimedia.org/T391270) (owner: 10Gergő Tisza) [13:52:44] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:52:47] (03CR) 10Hashar: "recheck the faulty CI build has vanished" [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147060 (https://phabricator.wikimedia.org/T373017) (owner: 10Reedy) [13:53:55] (03CR) 10Btullis: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1147768 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [13:54:10] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834856 (10Andrew) >>! In T376400#10825452, @taavi wrote: > The site at http://ec2-54-81-201-239.compute-1.amazonaws.com/ seems to embed images fro... [13:54:28] (03CR) 10Brouberol: [C:03+2] Convert snapshot1014 into a dse-k8s-worker host [puppet] - 10https://gerrit.wikimedia.org/r/1147768 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [13:56:14] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10834871 (10Jhancock.wm) I think I'm gonna make a physical list and post it somewhere in the DH5. for my personal reference. I will otherwise forget this is a thing. Thanks! [13:56:50] !log brouberol@cumin2002 START - Cookbook sre.hosts.rename from snapshot1014 to dse-k8s-worker1011 [13:57:14] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [13:57:32] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834875 (10taavi) >>! In T376400#10834856, @Andrew wrote: > Can you point me to some specific examples? My half-baked spot checks (e.g. http://ec2-... [13:57:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:57:53] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on backup1008:9290 - https://phabricator.wikimedia.org/T394674 (10phaultfinder) 03NEW [13:58:32] (03PS1) 10Andrew Bogott: Put cloudvirt10[68-76] into service [puppet] - 10https://gerrit.wikimedia.org/r/1147772 (https://phabricator.wikimedia.org/T394671) [14:00:30] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834892 (10Andrew) yep, I see it now. [14:00:36] (03CR) 10Andrew Bogott: [C:03+2] Put cloudvirt10[68-76] into service [puppet] - 10https://gerrit.wikimedia.org/r/1147772 (https://phabricator.wikimedia.org/T394671) (owner: 10Andrew Bogott) [14:00:45] (03PS7) 10Ilias Sarantopoulos: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [14:00:57] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:01:04] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1014 to dse-k8s-worker1011 - brouberol@cumin2002" [14:01:28] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1014 to dse-k8s-worker1011 - brouberol@cumin2002" [14:01:28] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:01:29] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1011 on all recursors [14:01:32] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1011 on all recursors [14:01:33] !log brouberol@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1011 [14:02:38] !log brouberol@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1011 [14:02:50] (03CR) 10CI reject: [V:04-1] ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [14:03:18] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1014 to dse-k8s-worker1011 [14:03:40] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [14:03:46] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [14:03:54] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:04:14] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:05:03] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [14:07:37] (03PS8) 10Ilias Sarantopoulos: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [14:13:05] (03PS9) 10Ilias Sarantopoulos: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [14:14:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:16:50] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cirrussearch2091.codfw.wmnet [14:17:48] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/dev-btullis on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:22:48] RESOLVED: HelmReleaseBadStatus: Helm release airflow-dev/dev-btullis on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:28:26] (03PS1) 10Brouberol: airflow: do not package the tls-termination service for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147777 (https://phabricator.wikimedia.org/T393999) [14:29:20] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:30:14] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1802, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [14:30:14] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:30:30] (03PS1) 10Clément Goubert: mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) [14:31:33] (03CR) 10CI reject: [V:04-1] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [14:33:14] (03PS1) 10Bking: cirrussearch: add cirrussearch row B/remove elastic row C [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) [14:33:32] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991#10835035 (10cmooney) p:05Triage→03Low [14:34:07] (03PS2) 10Clément Goubert: mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) [14:34:58] 06SRE, 10Page Content Service, 06serviceops: mobileapps returns 500 error in response to malformed /v1/page/summary path - https://phabricator.wikimedia.org/T394610#10835045 (10hnowlan) This issue was hopefully addressed in https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/1147736 [14:35:04] (03PS1) 10Vgutierrez: varnish: Prevent Vary: X-E-E from reaching users [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) [14:36:07] 06SRE, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10835053 (10hnowlan) I can no longer trigger a 500 when querying using one of the bad API paths... [14:36:59] (03CR) 10BBlack: [C:03+1] varnish: Prevent Vary: X-E-E from reaching users [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:37:34] (03CR) 10Mforns: [C:03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1147728 (https://phabricator.wikimedia.org/T367346) (owner: 10Vgutierrez) [14:37:42] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Add notifications-echo task creation route [puppet] - 10https://gerrit.wikimedia.org/r/1146874 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:37:58] (03PS2) 10Bking: cirrussearch: add cirrussearch row B/remove elastic row C [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) [14:38:02] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10835060 (10herron) >>! In T393797#10820949, @elukey wrote: > Thanks a lot! So the changes will not be reverted by future Pyrra filesystem syncs... [14:38:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:39:45] (03PS1) 10Effie Mouzeli: hieradata: add usernames for mw-expermental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) [14:40:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:41:04] (03PS1) 10Fabfur: hiera: disable vk on A:cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) [14:41:27] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: migrate echo_mail_batch to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1146875 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:41:40] 06SRE, 10Page Content Service, 06serviceops: mobileapps returns 500 error in response to malformed /v1/page/summary path - https://phabricator.wikimedia.org/T394610#10835083 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos Just verified this: GET https://en.wikipedia.org/api/rest_v1/page/summary/ext... [14:41:59] 06SRE, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Failure to decode mobileapps parameters should return a 404, not a 503 - https://phabricator.wikimedia.org/T394582#10835086 (10Jgiannelos) 05Open→03Resolved [14:42:07] (03CR) 10CI reject: [V:04-1] hiera: disable vk on A:cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [14:42:14] (03CR) 10Vgutierrez: "text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:42:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:43:13] (03PS2) 10Fabfur: hiera: disable vk on A:cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) [14:44:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [14:45:21] (03PS1) 10Effie Mouzeli: admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [14:45:31] (03PS3) 10Bking: cirrussearch: add cirrussearch row B/remove elastic row C [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) [14:45:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:48:44] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:49:05] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:54:15] (03PS3) 10Muehlenhoff: Deprecate AQS-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1146978 [14:54:24] !log dancy@deploy1003 Started scap sync-world: Updating images for T394389 [14:54:27] T394389: Migrate the additional dump types from snapshot1016 to Airflow - https://phabricator.wikimedia.org/T394389 [14:54:32] (03PS1) 10Elukey: maps: update grants-db-bookworm.sql [puppet] - 10https://gerrit.wikimedia.org/r/1147788 (https://phabricator.wikimedia.org/T381565) [14:55:53] brouberol@cumin2002 reimage (PID 2108301) is awaiting input [14:56:44] (03CR) 10Vgutierrez: [C:03+2] varnish: Prevent Vary: X-E-E from reaching users [puppet] - 10https://gerrit.wikimedia.org/r/1147780 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:58:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1147788 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:59:48] !log dancy@deploy1003 dancy: Updating images for T394389 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:59:53] T394389: Migrate the additional dump types from snapshot1016 to Airflow - https://phabricator.wikimedia.org/T394389 [15:00:15] (03CR) 10Muehlenhoff: [C:03+2] Deprecate AQS-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1146978 (owner: 10Muehlenhoff) [15:02:43] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [15:03:22] marostegui@cumin1002 clone (PID 3556919) is awaiting input [15:07:19] !log dancy@deploy1003 Finished scap sync-world: Updating images for T394389 (duration: 12m 55s) [15:07:24] T394389: Migrate the additional dump types from snapshot1016 to Airflow - https://phabricator.wikimedia.org/T394389 [15:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:11] (03CR) 10Elukey: [C:03+2] maps: update grants-db-bookworm.sql [puppet] - 10https://gerrit.wikimedia.org/r/1147788 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:10:43] (03CR) 10Btullis: [C:03+1] cirrussearch: add cirrussearch row B/remove elastic row C [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:12:31] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1068 [15:12:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1068 [15:12:59] (03CR) 10Btullis: airflow: do not package the tls-termination service for devenvs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147777 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [15:13:38] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:14:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:15:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:16:03] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-be2006'] [15:16:05] btullis: New images are ready for you to test. [15:16:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-be2006'] [15:16:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-be2006'] [15:16:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-be2006'] [15:16:48] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1068 to cloud-private vlan - andrew@cumin1002" [15:16:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1068 to cloud-private vlan - andrew@cumin1002" [15:16:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003'] [15:17:11] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views [15:17:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2003'] [15:17:33] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2005'] [15:17:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest2005'] [15:18:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc2018.codfw.wmnet with OS bookworm [15:18:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2018'] [15:18:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10835231 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm executed with errors: - pc2018 (**FAIL**... [15:19:51] !log installing systemd bugfix updates from Bookworm point release [15:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [15:21:34] (03CR) 10Dreamy Jazz: [C:03+1] Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [15:21:42] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10835258 (10MBinder_WMF) Thanks for the help. So I reset via email, and still can't log in with mbinder. I double checked the password I created is what's entered. I also tried... [15:23:34] FIRING: KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1010.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:23:39] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc2018'] [15:27:38] (03CR) 10Vgutierrez: [C:03+2] varnish: Track original host header value for cookie scope purposes [puppet] - 10https://gerrit.wikimedia.org/r/1147728 (https://phabricator.wikimedia.org/T367346) (owner: 10Vgutierrez) [15:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1530). [15:30:52] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:30:59] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10835337 (10Dzahn) a:03Dzahn [15:31:14] !log andrew@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:31:25] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1069 [15:31:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1069 [15:31:39] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1070 [15:32:10] !log andrew@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudvirt1070 [15:32:13] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1070 [15:32:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1070 [15:32:41] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1071 [15:32:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1071 [15:32:51] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1072 [15:33:03] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1072 [15:33:07] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1073 [15:33:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1073 [15:33:17] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1074 [15:33:24] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1074 [15:33:27] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1075 [15:33:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1075 [15:33:37] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1076 [15:33:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1076 [15:33:58] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:34:17] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: eqiad: introduce openstack octavia support [puppet] - 10https://gerrit.wikimedia.org/r/1147794 (https://phabricator.wikimedia.org/T394099) [15:34:20] !log dancy@deploy1003 Installing scap version "4.169.1" for 2 host(s) [15:35:12] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147794 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [15:35:28] (03PS1) 10Phuedx: ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147796 (https://phabricator.wikimedia.org/T394457) [15:35:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:36:08] !log dancy@deploy1003 Installation of scap version "4.169.1" completed for 2 hosts [15:37:13] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1069-1076 - andrew@cumin1002" [15:37:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1069-1076 - andrew@cumin1002" [15:37:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:39:20] (03CR) 10Bking: [C:03+2] cirrussearch: add cirrussearch row B/remove elastic row C [puppet] - 10https://gerrit.wikimedia.org/r/1147779 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:40:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:42:13] (03PS1) 10Majavah: openstack: puppet: Always quote strings ENC YAML responses [puppet] - 10https://gerrit.wikimedia.org/r/1147801 (https://phabricator.wikimedia.org/T394691) [15:43:05] !log uploading lua5.3-maxminddb deb package to apt repo (currently unused) (T394504) [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:09] T394504: Deb package for github.com/fabled/lua-maxminddb - https://phabricator.wikimedia.org/T394504 [15:43:23] (03PS25) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [15:43:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:02] (03CR) 10Andrew Bogott: [C:03+1] openstack: puppet: Always quote strings ENC YAML responses [puppet] - 10https://gerrit.wikimedia.org/r/1147801 (https://phabricator.wikimedia.org/T394691) (owner: 10Majavah) [15:44:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147796 (https://phabricator.wikimedia.org/T394457) (owner: 10Phuedx) [15:46:47] RESOLVED: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:27] (03PS1) 10Elukey: profile::prometheus: remove istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1147803 (https://phabricator.wikimedia.org/T387350) [15:48:04] (03CR) 10Majavah: [C:03+2] openstack: puppet: Always quote strings ENC YAML responses [puppet] - 10https://gerrit.wikimedia.org/r/1147801 (https://phabricator.wikimedia.org/T394691) (owner: 10Majavah) [15:49:21] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for cortobot [puppet] - 10https://gerrit.wikimedia.org/r/1145808 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:52:24] FIRING: [2x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:05] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1074.eqiad.wmnet|cirrussearch1075.eqiad.wmnet|cirrussearch1076.eqiad.wmnet|cirrussearch1077.eqiad.wmnet|cirrussearch1078.eqiad.wmnet|cirrussearch1079.eqiad.wmnet|cirrussearch1113.eqiad.wmnet|cirrussearch1114.eqiad.wmnet|cirrussearch1115.eqiad.wmnet|cirrussearch1116.eqiad.wmnet|cirrussearch1117.eqiad.wmnet [15:53:15] (03CR) 10BCornwall: [C:03+1] trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:53:54] (03CR) 10BBlack: [C:03+1] trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:55:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:55:31] (03CR) 10JHathaway: [C:03+1] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [15:58:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_codfw - > [15:58:25] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_codfw - > [15:59:24] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:00:16] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1802, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [16:00:16] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:48] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cirrussearch2110.codfw.wmnet with reason: firmware update [16:03:50] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:03:55] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10835568 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=383d5b3f-9716-4e3c-b3e5-5970c4f0c111) set by bking@cumin2002 for... [16:04:57] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10835571 (10MoritzMuehlenhoff) >>! In T392629#10831269, @jhathaway wrote: > that is an interesting idea, we could perhaps cat the full unit and the override together, then run the verify command Or let's rather use "systemctl cat foo.s... [16:07:59] (03CR) 10Herron: [C:03+1] ircecho: exit upon disconnection [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [16:08:08] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:09:02] (03CR) 10Herron: [C:03+1] profile::prometheus: remove istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1147803 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [16:10:40] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10835605 (10bking) @RobH looks like `cirrussearch2110.codfw.wmnet` will work for testing. I've banned, depooled and downtimed the host for... [16:12:01] (03PS2) 10Dwisehaupt: spf wikimedia.org: add community-crm SPF record [dns] - 10https://gerrit.wikimedia.org/r/1147046 (https://phabricator.wikimedia.org/T383715) [16:12:24] PROBLEM - ensure kvm processes are running on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:12:24] FIRING: [2x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:30] (03CR) 10Jgreen: [C:03+1] spf wikimedia.org: add community-crm SPF record [dns] - 10https://gerrit.wikimedia.org/r/1147046 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:12:38] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1179 gradually with 4 steps - Pool db1179.eqiad.wmnet in after cloning [16:13:25] (03CR) 10Jgreen: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [16:16:17] (03PS5) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [16:16:42] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:17:46] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [16:17:57] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:18:24] (03PS2) 10BCornwall: cdn: Fix args reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [16:21:00] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:24] RESOLVED: [2x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:24] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:25:16] PROBLEM - ensure kvm processes are running on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:25:24] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:29:34] PROBLEM - ensure kvm processes are running on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:32:22] any objection to me running a quick backport via spiderpig? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Chart/+/1147059 [16:33:42] gonna put it in and hope it dont explode :D [16:33:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147059 (https://phabricator.wikimedia.org/T392725) (owner: 10Bvibber) [16:35:34] (03CR) 10Vgutierrez: "you might wanna use self._reason.reason instead of the raw reason provided by the user" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [16:35:38] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:35:39] !log andrew@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:35:42] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:37:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10835810 (10Jclark-ctr) [16:38:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10835815 (10Jclark-ctr) a:03Jclark-ctr [16:38:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10835831 (10Jclark-ctr) [16:39:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10835832 (10Jclark-ctr) 05Open→03Resolved [16:39:15] (03PS3) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [16:39:24] (03CR) 10BCornwall: "Good point, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [16:39:45] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correcting cloudvirt1072-1076 - andrew@cumin1002" [16:39:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correcting cloudvirt1072-1076 - andrew@cumin1002" [16:39:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:06] (03PS1) 10Andrew Bogott: Add cloudvirt id for cloudvirt1068-1071 [puppet] - 10https://gerrit.wikimedia.org/r/1147811 (https://phabricator.wikimedia.org/T394671) [16:41:51] (03Merged) 10jenkins-bot: Render Data:.chart page reviews in user language [extensions/Chart] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147059 (https://phabricator.wikimedia.org/T392725) (owner: 10Bvibber) [16:42:08] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1147059|Render Data:.chart page reviews in user language (T392725)]] [16:42:11] T392725: Data:.chart pages should preview showing the user language and ?uselang= - https://phabricator.wikimedia.org/T392725 [16:44:24] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [16:44:27] (03CR) 10JMeybohm: [C:03+1] calico: Set veth_mtu to 1480 for staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [16:45:22] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [16:45:37] (03CR) 10JMeybohm: [C:03+1] kartotherian: simplify the readinessProble's path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [16:45:50] (03CR) 10CI reject: [V:04-1] cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [16:45:59] (03Abandoned) 10JMeybohm: CI test change - do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm) [16:46:29] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1147059|Render Data:.chart page reviews in user language (T392725)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:47:04] looks good on test servers [16:47:07] where's my button :D [16:47:10] !log bvibber@deploy1003 bvibber: Continuing with sync [16:47:13] there it is [16:48:29] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:53:43] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@d07b52d]: Deploy latest Airflow DAGs for the main instance. T392494. [16:53:46] T392494: Add data quality metrics to mediawiki_content_current_v1 - https://phabricator.wikimedia.org/T392494 [16:54:03] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147059|Render Data:.chart page reviews in user language (T392725)]] (duration: 11m 54s) [16:54:07] T392725: Data:.chart pages should preview showing the user language and ?uselang= - https://phabricator.wikimedia.org/T392725 [16:54:19] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@d07b52d]: Deploy latest Airflow DAGs for the main instance. T392494. (duration: 00m 36s) [16:54:38] two unresolvable hostnames? [16:54:41] that doesn't sound good [16:54:42] (03CR) 10David Caro: [C:03+1] openstack: puppet: Always quote strings ENC YAML responses [puppet] - 10https://gerrit.wikimedia.org/r/1147801 (https://phabricator.wikimedia.org/T394691) (owner: 10Majavah) [16:54:53] https://spiderpig.wikimedia.org/jobs/75 <-- [16:56:12] (03CR) 10BCornwall: [C:03+1] varnish: Issue and handle WMF-Uniq cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:56:56] (03PS1) 10Dzahn: Revert "firewall/nftables_throttling: temp add rule to allow Istanbul Hackathon" [puppet] - 10https://gerrit.wikimedia.org/r/1147815 [16:57:25] RECOVERY - ensure kvm processes are running on cloudvirt1069 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:57:34] RECOVERY - ensure kvm processes are running on cloudvirt1068 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:58:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1179 gradually with 4 steps - Pool db1179.eqiad.wmnet in after cloning [16:58:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1179.eqiad.wmnet onto db1183.eqiad.wmnet [16:58:16] RECOVERY - ensure kvm processes are running on cloudvirt1071 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:58:43] (03CR) 10BCornwall: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [16:59:51] (03PS26) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [17:00:05] swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC late) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1700). [17:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1700). [17:00:35] o/ [17:00:52] (03CR) 10Scott French: [C:03+2] P:mw:maint:update_special_pages: remove absented non-sharded job [puppet] - 10https://gerrit.wikimedia.org/r/1146787 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [17:01:02] (03PS1) 10Dzahn: remove throttling config for Istanbul Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147816 (https://phabricator.wikimedia.org/T382309) [17:02:25] (03PS2) 10Dzahn: remove throttling config for Istanbul Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147816 (https://phabricator.wikimedia.org/T382309) [17:02:32] (03CR) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:02:33] (03PS2) 10Dzahn: Revert "firewall/nftables_throttling: temp add rule to allow Istanbul Hackathon" [puppet] - 10https://gerrit.wikimedia.org/r/1147815 (https://phabricator.wikimedia.org/T382309) [17:04:22] (03CR) 10Dwisehaupt: [C:03+2] spf wikimedia.org: add community-crm SPF record [dns] - 10https://gerrit.wikimedia.org/r/1147046 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:04:37] !log dwisehaupt@dns1004 START - running authdns-update [17:05:21] !log dwisehaupt@dns1004 END - running authdns-update [17:05:25] (03CR) 10Dzahn: [C:03+2] Revert "firewall/nftables_throttling: temp add rule to allow Istanbul Hackathon" [puppet] - 10https://gerrit.wikimedia.org/r/1147815 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [17:06:10] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:47] (03CR) 10Scott French: [C:03+2] P:mw:maint:update_special_pages: updateSpecialPages in s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146788 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [17:08:26] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10835971 (10jhathaway) >>! In T392629#10835571, @MoritzMuehlenhoff wrote: >>>! In T392629#10831269, @jhathaway wrote: >> that is an interesting idea, we could perhaps cat the full unit and the override together, then run the verify comm... [17:09:01] (03PS2) 10Scott French: P:mw:maint:update_special_pages: updateSpecialPages in s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146788 (https://phabricator.wikimedia.org/T388534) [17:13:07] (03CR) 10Scott French: [C:03+2] P:mw:maint:update_special_pages: updateSpecialPages in s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146788 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [17:13:22] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [17:16:13] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1058 to cirrussearch1058 [17:16:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:26] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:19:42] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1058 to cirrussearch1058 - bking@cumin2002" [17:19:47] jouncebot: nowandnext [17:19:47] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1700) [17:19:47] For the next 0 hour(s) and 10 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T1700) [17:19:47] In 2 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2000) [17:20:52] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1058 to cirrussearch1058 - bking@cumin2002" [17:20:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:53] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1058 on all recursors [17:20:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1058 on all recursors [17:20:57] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1058 [17:21:11] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:21:18] is mediawiki deployment happening now? [17:21:22] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:21:56] (03CR) 10Btullis: [C:03+2] airflow-analytics-test: Raise scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [17:22:11] mutante: I'm wrapping up some helmfile-only changes [17:22:26] swfrench-wmf: how about mediawiki-config deploy? [17:22:29] I have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1147816 [17:22:37] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1058 [17:22:38] reverting a throttle rule [17:23:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1058 to cirrussearch1058 [17:23:27] (03Merged) 10jenkins-bot: airflow-analytics-test: Raise scheduler limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147760 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [17:23:56] mutante: I think I'm done (changes applied cleanly), so you're good to go [17:24:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1058.eqiad.wmnet with OS bullseye [17:24:53] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1058 [17:24:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1058 [17:25:38] swfrench-wmf: thanks! ack [17:26:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [17:27:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dzahn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147816 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [17:28:00] (03Merged) 10jenkins-bot: remove throttling config for Istanbul Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147816 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [17:28:14] !log dzahn@deploy1003 Started scap sync-world: Backport for [[gerrit:1147816|remove throttling config for Istanbul Hackathon (T382309)]] [17:28:23] spiderpig, spiderpig.. does whatever a spiderpig does [17:32:00] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:32:10] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on backup1008:9290 - https://phabricator.wikimedia.org/T394674#10836162 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated cable [17:32:18] !log dzahn@deploy1003 dzahn: Backport for [[gerrit:1147816|remove throttling config for Istanbul Hackathon (T382309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:32:47] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on backup1008 - https://phabricator.wikimedia.org/T394673#10836169 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated cable [17:32:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10836173 (10Dzahn) ` 17:32 <+icinga-wm> PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex ar... [17:32:57] bking@cumin2002 reimage (PID 2199250) is awaiting input [17:33:10] !log dzahn@deploy1003 dzahn: Continuing with sync [17:33:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10836174 (10VRiley-WMF) a:03VRiley-WMF [17:36:36] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:36:45] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:36:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10836212 (10Marostegui) @VRiley-WMF I'd need to depool the server first. [17:37:24] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10836216 (10VRiley-WMF) Hey @Marostegui we certainly do! Is there a preferred time or date to swap out the memory? Let us know, thanks! [17:38:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10836223 (10Marostegui) @VRiley-WMF thanks! I'll have the host ready for you tomorrow if that's ok? [17:39:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10836240 (10VRiley-WMF) @Marostegui That works for me. I'll plan for it then. Thanks! [17:39:49] !log dzahn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147816|remove throttling config for Istanbul Hackathon (T382309)]] (duration: 11m 34s) [17:40:39] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:47:01] (03CR) 10Scott French: [C:03+1] mw:maintenance: Fix newlines in kubernetes periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1147754 (owner: 10Clément Goubert) [17:48:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:13] brouberol: hi. it looks like 2 snapshot hosts were renamed. but they are not completely removed. This leads to mw deployment scap failure.. "cant lookup in DNS" [17:49:27] snapshot1014 and 1017 [17:50:19] ah, I'm working on 1014 and snapshot1017 should have been re-imaged and renamed. What do you mean by "not completely removed"? Removed from some scap config? [17:51:07] Oh, I think I know. I'll send a patch right away [17:52:28] brouberol: scap tries to look them up in DNS.. that makes me think they are in "dsh groups".. so that should mean hieradata now [17:52:53] like the hiera config that replaced the old dsh groups and defines which hosts are "mw hosts" [17:53:00] and thanks, ack [17:53:10] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:53:18] (03PS1) 10Brouberol: remove decommissioned snapshot10{14,17} from hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1147826 (https://phabricator.wikimedia.org/T394647) [17:53:37] mutante https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147826 [17:53:38] re: icinga alers. those hosts were apparently reimaged while keeping the puppet prod role on them.. leading to monitoring noise [17:53:49] left a comment on the ticket for dcops [17:54:05] (03CR) 10Dzahn: [C:03+1] remove decommissioned snapshot10{14,17} from hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1147826 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [17:54:20] brouberol: yes:) that is what I had i mind. +1 [17:54:31] (03CR) 10Scott French: [C:03+1] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [17:55:01] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:55:03] (03CR) 10Brouberol: [C:03+2] remove decommissioned snapshot10{14,17} from hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1147826 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [17:55:25] puppet-merge-d [17:55:41] thanks [17:55:49] sorry about that [17:56:19] I'll run puppet on the deployment host right after [17:57:21] no worries. thx [17:58:06] One thing I'm not sure I understand is " those hosts were apparently reimaged while keeping the puppet prod role on them". Wasn't changing their role in site.pp + preseed in preseed.yaml enough? [17:58:54] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:59:42] brouberol: that comment was about OTHER alerts.. the cloudvirt hosts.. too much alerting noise [17:59:52] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:59:55] oh sorry [17:59:57] ^this [17:59:57] gotcha [18:00:32] there are 3 of them. it's an old problem that things get reimaged with the puppet role and no downtimes..and then once the role gets applied.. BOOM.. alert [18:01:51] understood. [18:02:02] puppet has run on the deployment server, you should be gtg [18:03:07] alright! you made the next deployers happy. I think I dont have to do anything and can just leave it. the error was just "cant deploy to these 2 hosts" but it did the rest [18:03:37] just looked bad in the UI with the failures [18:05:34] (03CR) 10Brouberol: airflow: do not package the tls-termination service for devenvs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147777 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [18:06:00] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:07:52] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:10] RECOVERY - ensure kvm processes are running on cloudvirt1072 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:44] RECOVERY - ensure kvm processes are running on cloudvirt1076 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:09:08] nice, there we go [18:12:29] (03PS1) 10Brouberol: Upgrade airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1147831 (https://phabricator.wikimedia.org/T393999) [18:13:09] (03CR) 10Dzahn: "I would like to add to this we also reduced the size of gerrit backups quite a bit recently.. by deleting old data we don't need anymore. " [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [18:13:21] (03CR) 10Dzahn: [C:03+2] gerrit: enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [18:15:07] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [18:15:34] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1147831 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [18:18:29] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:24:48] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [18:27:07] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1076.eqiad.wmnet with OS bookworm [18:27:09] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1075.eqiad.wmnet with OS bookworm [18:27:12] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm [18:29:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10836565 (10Dzahn) I will do this. But for both data centers. Not just one. We have said we do not want to create sin... [18:31:05] PROBLEM - statsv Varnishkafka log producer on cp6013 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:31:10] (03PS1) 10Jgleeson: Make BundleSizeTest cross-compatible with <=1.44 and >=1.45 [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147839 (https://phabricator.wikimedia.org/T394542) [18:31:47] (03CR) 10Andrew Bogott: [C:03+2] Add cloudvirt id for cloudvirt1068-1071 [puppet] - 10https://gerrit.wikimedia.org/r/1147811 (https://phabricator.wikimedia.org/T394671) (owner: 10Andrew Bogott) [18:32:05] RECOVERY - statsv Varnishkafka log producer on cp6013 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:33:05] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:33:26] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:33:53] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:34:37] (03PS1) 10Andrew Bogott: Add cloudvirt id for cloudvirt1072-1076 [puppet] - 10https://gerrit.wikimedia.org/r/1147844 (https://phabricator.wikimedia.org/T394671) [18:34:51] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:35:04] (03CR) 10Jgleeson: [C:04-1] "Looks like we needed more than a namespace update here. I tracked down this ticket https://phabricator.wikimedia.org/T393122, which lined " [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147060 (https://phabricator.wikimedia.org/T373017) (owner: 10Reedy) [18:36:23] (03CR) 10Andrew Bogott: [C:03+2] Add cloudvirt id for cloudvirt1072-1076 [puppet] - 10https://gerrit.wikimedia.org/r/1147844 (https://phabricator.wikimedia.org/T394671) (owner: 10Andrew Bogott) [18:43:01] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [18:43:29] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:43:34] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [18:43:56] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [18:45:44] andrew@cumin1002 reimage (PID 3597982) is awaiting input [18:46:20] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [18:46:22] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul2001.codfw.wmnet [18:46:24] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:48:29] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:50:02] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [18:52:02] dzahn@cumin1002 makevm (PID 3601862) is awaiting input [18:53:29] FIRING: [5x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [18:54:32] (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [18:56:09] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_codfw - > [18:58:04] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2001.codfw.wmnet - dzahn@cumin1002" [18:58:11] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2001.codfw.wmnet - dzahn@cumin1002" [18:58:11] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:58:11] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul2001.codfw.wmnet on all recursors [18:58:14] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul2001.codfw.wmnet on all recursors [18:58:43] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2001.codfw.wmnet - dzahn@cumin1002" [18:58:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2001.codfw.wmnet - dzahn@cumin1002" [19:00:23] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_codfw - > [19:00:26] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1072.eqiad.wmnet with OS bookworm [19:01:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqiad - > [19:01:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad - > [19:01:49] dzahn@cumin1002 makevm (PID 3601862) is awaiting input [19:02:27] (03CR) 10BCornwall: [C:03+1] varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [19:05:22] dzahn@cumin1002 makevm (PID 3601862) is awaiting input [19:08:29] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [19:08:48] (03PS1) 10Dzahn: site: add zuul VMs with collab-insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/1147855 (https://phabricator.wikimedia.org/T393873) [19:10:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10836706 (10wiki_willy) Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement. [19:11:30] (03CR) 10BBlack: varnish: Issue and handle WMF-Uniq cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [19:12:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:15:13] jhancock@cumin2002 provision (PID 2249775) is awaiting input [19:15:58] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1076.eqiad.wmnet with OS bookworm [19:16:15] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1073.eqiad.wmnet with OS bookworm [19:18:42] !log Ran fixStuckGlobalRename.php for T394699 [19:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:46] T394699: Unblock stuck global rename of Sklcq - https://phabricator.wikimedia.org/T394699 [19:19:26] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1075.eqiad.wmnet with OS bookworm [19:23:29] FIRING: KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1010.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:26:02] FIRING: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:26:10] FIRING: SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:26:39] FIRING: WebrequestSampledDown: Benthos metrics for webrequest_sampled are not reported from eqiad and codfw - https://wikitech.wikimedia.org/wiki/Benthos#Benthos_on_centrallog - https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?var-port=4151 - https://alerts.wikimedia.org/?q=alertname%3DWebrequestSampledDown [19:26:44] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [19:26:48] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [19:28:11] (03CR) 10Dzahn: [C:03+2] site: add zuul VMs with collab-insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/1147855 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [19:28:29] FIRING: [5x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:28:29] FIRING: [9x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:29:05] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1072.eqiad.wmnet with OS bookworm [19:31:02] RESOLVED: [6x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:31:18] RESOLVED: [5x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:31:39] RESOLVED: WebrequestSampledDown: Benthos metrics for webrequest_sampled are not reported from eqiad and codfw - https://wikitech.wikimedia.org/wiki/Benthos#Benthos_on_centrallog - https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?var-port=4151 - https://alerts.wikimedia.org/?q=alertname%3DWebrequestSampledDown [19:31:44] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [19:31:48] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [19:33:56] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul2001.codfw.wmnet with OS bookworm [19:34:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10836773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul2001... [19:41:05] (03CR) 10Vgutierrez: [C:03+1] hiera: disable vk on A:cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [19:43:41] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [19:44:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:48:15] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit2003), No backups: 1 (gerrit2003), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:59:10] ^ I just merged a change to add that host to backups earlier.. so it's not a surprise to me there isn't one yet. I guess it's a known minor issue it alerts on that. [19:59:25] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2000). [20:00:05] danisztls and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1802, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [20:00:17] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:02:03] o/ [20:02:19] jouncebot: now [20:02:19] For the next 0 hour(s) and 57 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2000) [20:03:15] danisztls: are you going to use spiderpig? [20:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:29] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:03:49] o/ [20:04:46] mutante: no, not sure [20:05:26] mutante: I'm not familiar to ir [20:05:28] *it [20:05:49] danisztls: it's the new web UI to deploy.. basically it's clicking buttons now [20:06:16] mutante: cool, do I have perms for that? [20:06:50] danisztls: I am not sure. let's find out. if you look at the link above (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2000) see that new "Deploy change" link there? [20:07:03] it links to https://spiderpig.wikimedia.org/?backport=1147763 [20:07:24] wonder if you can login to that [20:07:44] andrew@cumin1002 reimage (PID 3614084) is awaiting input [20:07:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1085.eqiad.wmnet with OS bullseye [20:07:49] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1085 [20:07:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1085 [20:08:28] mutante: missing privileges [20:09:16] danisztls: have you ever used the deployment server via ssh login? [20:09:43] mutante: yes, but for miscweb only [20:09:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1086.eqiad.wmnet with OS bullseye [20:09:58] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1086 [20:09:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1086 [20:10:33] danisztls: try this (if you want, nothing bad will happen): ssh deploy1003.eqiad.wmnet scap spiderpig-otp [20:10:43] it should give you a one-time code [20:10:51] which then lets you log into that web UI [20:12:00] jhancock@cumin2002 provision (PID 2264070) is awaiting input [20:13:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:14:59] mutante: I'm able to generate the OTP code but neither the OTP code or wikitech password works for loging in [20:15:07] mutante: "Service access denied due to missing privileges." [20:15:33] (03CR) 10Btullis: [C:03+1] airflow: do not package the tls-termination service for devenvs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147777 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [20:17:17] danisztls: I see. ok.. I am not sure exactly where that is configured. that's a question for releng. I just wanted to share the new thing. [20:17:22] mutante: I'm not on https://ldap.toolforge.org/group/spiderpig-access [20:17:32] aha, I see [20:17:39] mutante: thanks, it's cool that we have this option [20:17:58] (03PS1) 10Bking: WIP: cirrussearch: Add newly-reimaged hosts back to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1147871 (https://phabricator.wikimedia.org/T388610) [20:18:10] danisztls: do you want me to just do it then? [20:20:19] mutante: yes, please [20:20:25] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1058.eqiad.wmnet with OS bullseye [20:20:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dzahn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147763 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:20:58] danisztls: doing!:) also see PM [20:21:39] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1059 to cirrussearch1059 [20:21:52] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:22:27] (03Merged) 10jenkins-bot: Design Research survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147763 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:22:44] !log dzahn@deploy1003 Started scap sync-world: Backport for [[gerrit:1147763|Design Research survey: Increase coverage (T394315)]] [20:22:47] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:23:19] !log Downgrade rsyslog, rsyslog-kafka, and rsyslog-openssl to `8.2302.0-1+deb12u1_amd64` - T383309 [20:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:23] T383309: rsyslog receiver on centrallog hosts misplaces some log host entries - https://phabricator.wikimedia.org/T383309 [20:24:21] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host zuul2001.codfw.wmnet with OS bookworm [20:24:21] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host zuul2001.codfw.wmnet [20:24:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10837026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.cod... [20:25:01] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1086.eqiad.wmnet with reason: host reimage [20:25:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [20:25:24] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10837033 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm [20:25:26] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1059 to cirrussearch1059 - bking@cumin2002" [20:25:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1059 to cirrussearch1059 - bking@cumin2002" [20:25:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:25:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1059 on all recursors [20:25:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1059 on all recursors [20:25:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1059 [20:26:59] !log dzahn@deploy1003 dani, dzahn: Backport for [[gerrit:1147763|Design Research survey: Increase coverage (T394315)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:01] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1059 [20:27:37] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1085.eqiad.wmnet with reason: host reimage [20:27:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1059 to cirrussearch1059 [20:28:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1086.eqiad.wmnet with reason: host reimage [20:29:23] !log dzahn@deploy1003 dani, dzahn: Continuing with sync [20:31:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1085.eqiad.wmnet with reason: host reimage [20:32:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1059.eqiad.wmnet with OS bullseye [20:32:14] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1059 [20:32:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1059 [20:36:18] !log dzahn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147763|Design Research survey: Increase coverage (T394315)]] (duration: 13m 33s) [20:36:21] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:38:18] tgr: in case you were waiting..the other patch is done now! [20:39:58] bking@cumin2002 reimage (PID 2286898) is awaiting input [20:40:27] (03CR) 10Dzahn: [C:03+2] "got an alert about backup freshness but I am 99% sure this is just a known minor issue when a new host is added. like.. no surprise there " [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [20:41:44] thx mutante [20:43:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146013 (https://phabricator.wikimedia.org/T391270) (owner: 10Gergő Tisza) [20:44:07] (03CR) 10BPirkle: [C:03+1] rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [20:44:13] (03Merged) 10jenkins-bot: [noop] Set $wgCentralAuthRestrictSharedDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146013 (https://phabricator.wikimedia.org/T391270) (owner: 10Gergő Tisza) [20:44:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1059.eqiad.wmnet with OS bullseye [20:44:24] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1146013|[noop] Set $wgCentralAuthRestrictSharedDomain (T391270)]] [20:44:28] T391270: Determine CentralAuth SUL3 defaults - https://phabricator.wikimedia.org/T391270 [20:47:54] (03PS1) 10Dzahn: installserver: add partman stanza for zuul* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147878 (https://phabricator.wikimedia.org/T393873) [20:48:19] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1080 to cirrussearch1080 [20:48:33] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:48:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:39] (03CR) 10Dzahn: [C:03+2] installserver: add partman stanza for zuul* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147878 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [20:49:14] !log tgr@deploy1003 tgr: Backport for [[gerrit:1146013|[noop] Set $wgCentralAuthRestrictSharedDomain (T391270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:50:59] !log tgr@deploy1003 tgr: Continuing with sync [20:54:14] bking@cumin2002 rename (PID 2296236) is awaiting input [20:54:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10837125 (10VRiley-WMF) There has been a new ticket opened for this unit for RMA. It is Service Request 210136653 [20:55:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1086.eqiad.wmnet with OS bullseye [20:55:54] Hey all - was going to sec-deploy 4 patches here in a bit, but I see logstash error rates are kind of high due to that commons $_SESSION error. Should I hold off? [20:56:08] RECOVERY - Restbase root url on restbase1043 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/RESTBase [20:56:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1085.eqiad.wmnet with OS bullseye [20:56:50] sbassett: I think tgr is still deploying and it's a question for him [20:57:49] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146013|[noop] Set $wgCentralAuthRestrictSharedDomain (T391270)]] (duration: 13m 24s) [20:57:53] T391270: Determine CentralAuth SUL3 defaults - https://phabricator.wikimedia.org/T391270 [20:58:34] !log late UTC deploys done [20:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:45] stashbot: ^ [20:58:46] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:58:50] sbassett: ^ [20:59:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10837151 (10Jclark-ctr) a:03Jclark-ctr [20:59:25] sorry, didn't read the question [20:59:33] yeah it's safe to ignore that error [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2100). [21:00:20] it's a deprecation notice and I have trouble figuring out exactly where it is coming from, but it doesn't affect anything other than causing noise [21:01:18] (the relevant task is T393963 FWIW) [21:01:19] T393963: PHP Deprecated: Use of $_SESSION was deprecated in MediaWiki 1.27. [Called from session_write_close in (internal function)] - https://phabricator.wikimedia.org/T393963 [21:01:49] (03PS1) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [21:02:08] tgr: ok, thanks! [21:02:10] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1080 to cirrussearch1080 - bking@cumin2002" [21:02:53] (03CR) 10CI reject: [V:04-1] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [21:03:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1080 to cirrussearch1080 - bking@cumin2002" [21:03:14] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:14] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1080 on all recursors [21:03:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1080 on all recursors [21:03:18] (03PS1) 10Scardenasmolinar: Add AutoModerator to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 [21:03:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1080 [21:04:10] (03PS2) 10Scardenasmolinar: Add AutoModerator to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 (https://phabricator.wikimedia.org/T391248) [21:04:37] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1080 [21:05:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1080 to cirrussearch1080 [21:06:00] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1072.eqiad.wmnet with OS bookworm [21:06:32] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [21:07:13] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1080.eqiad.wmnet with OS bullseye [21:07:17] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1080 [21:07:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1080 [21:07:26] (03CR) 10Greg Grossmeier: "recheck" [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [21:10:53] (03PS1) 10Bking: WIP: cirrussearch: Add newly-reimaged hosts back to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1147871 (https://phabricator.wikimedia.org/T388610) [21:11:07] jhancock@cumin2002 reimage (PID 2281766) is awaiting input [21:11:52] (03PS2) 10Bking: cirrussearch: Add newly-reimaged hosts back to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1147871 (https://phabricator.wikimedia.org/T388610) [21:12:40] (03PS1) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [21:16:42] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:16:44] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:14] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:44] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:18:14] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:18:40] (03CR) 10Bking: [C:03+2] cirrussearch: Add newly-reimaged hosts back to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1147871 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:18:42] RECOVERY - ensure kvm processes are running on cloudvirt1076 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:20:54] !log Deployed security fixes for T394692, T394693 and T394700 [21:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:36] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad. [21:21:36] ikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1161.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke [21:21:36] iad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [21:21:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1081 to cirrussearch1081 [21:21:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad. [21:21:58] ikikube-worker1281.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1252.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worke [21:21:58] iad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [21:21:59] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:22:26] ^^ well that doesn't look good [21:22:32] o/ [21:22:37] indeed no [21:23:36] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:23:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:25:07] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1081 to cirrussearch1081 - bking@cumin2002" [21:25:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1081 to cirrussearch1081 - bking@cumin2002" [21:25:37] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1081 on all recursors [21:25:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1081 on all recursors [21:25:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1081 [21:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:26:50] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1081 [21:26:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1081 to cirrussearch1081 [21:27:51] jhathaway: the shellbox-video alert was probably due to a couple of back-to-back backports temporarily causing mw-videoscaler jobs to run more transcodes than we have capacity in shellbox-video [21:27:59] checking now [21:28:14] thanks swfrench-wmf [21:30:35] andrew@cumin1002 reimage (PID 3624656) is awaiting input [21:31:44] jhathaway: yes, that appears to be what happened. I need to step away for a 20-30m, but if this returns, I can add a bit of capacity on the shellbox-video side to absorb the "extra" throughput [21:31:57] (i.e., if the older jobs don't drain off quickly enough) [21:32:06] sounds good, thanks again swfrench-wmf [21:32:31] (03PS1) 10Btullis: Add dse-k8s-worker10[10-11] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1147887 (https://phabricator.wikimedia.org/T394647) [21:34:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5599/co" [puppet] - 10https://gerrit.wikimedia.org/r/1147887 (https://phabricator.wikimedia.org/T394647) (owner: 10Btullis) [21:35:37] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1072.eqiad.wmnet with OS bookworm [21:36:08] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [21:37:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1067.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1259.eqiad. [21:37:58] ikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1252.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worke [21:37:58] iad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [21:38:29] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:27] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:36] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1051.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1050.eqiad. [21:39:36] ikikube-worker1007.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1069.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worke [21:39:36] iad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1106.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [21:39:44] o/ back earlier than expected, and indeed it seems this is still happening =/ [21:39:56] nod [21:40:08] what graphs are you looking at? [21:40:37] https://grafana.wikimedia.org/goto/ChLWg8-Hg?orgId=1 and https://grafana.wikimedia.org/goto/8rXZgU-HR?orgId=1 [21:40:41] thanks [21:40:52] the second one is the best correlate for when this will alert [21:41:12] specifically, when available replicas hits zero [21:42:39] !log Deployed security fix for T394396 [21:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:13] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [21:43:55] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [21:44:09] That should be it for the security deployment window today [21:44:27] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [21:45:34] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10837402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (... [21:45:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:46:13] (03CR) 10Ladsgroup: [C:03+1] ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [21:46:36] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:47:01] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10837411 (10Jhancock.wm) @Papaul this one has a partman file but i think it might have been for an older version of the sretest2005 server we had before. Can we get an update to the pres... [21:47:16] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10837413 (10Jhancock.wm) [21:47:42] (03PS3) 10Fabfur: hiera: disable vk (webrequest) on A:cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) [21:47:53] (03CR) 10Fabfur: hiera: disable vk (webrequest) on A:cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [21:48:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740 (10thcipriani) 03NEW [21:48:45] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [21:49:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10837426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu... [21:51:09] (03CR) 10Ryan Kemper: [C:03+2] relforge: remove config prior to decommission [puppet] - 10https://gerrit.wikimedia.org/r/1140717 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [21:52:24] (03PS1) 10Scott French: shellbox-video: increase replica count buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147890 [21:54:24] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [21:56:15] (03CR) 10JHathaway: [C:03+1] shellbox-video: increase replica count buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147890 (owner: 10Scott French) [21:56:26] (03PS1) 10Ryan Kemper: decom relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/1147891 (https://phabricator.wikimedia.org/T390565) [21:56:27] (03PS1) 10Ryan Kemper: relforge: simplify/consolidate site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1147892 [21:57:08] (03CR) 10Scott French: [C:03+2] shellbox-video: increase replica count buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147890 (owner: 10Scott French) [21:57:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147892 (owner: 10Ryan Kemper) [21:57:29] (03CR) 10Bking: [C:03+1] relforge: simplify/consolidate site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1147892 (owner: 10Ryan Kemper) [21:58:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT leases) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:58:03] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [21:58:06] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqiad - > [21:58:36] (03Merged) 10jenkins-bot: shellbox-video: increase replica count buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147890 (owner: 10Scott French) [21:59:18] (03CR) 10Ryan Kemper: [C:03+2] decom relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/1147891 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [21:59:20] (03CR) 10Ryan Kemper: [C:03+2] relforge: simplify/consolidate site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1147892 (owner: 10Ryan Kemper) [22:00:02] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [22:00:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1080.eqiad.wmnet with reason: host reimage [22:01:11] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [22:01:15] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts relforge[1003-1004].eqiad.wmnet [22:02:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1209 and db2195 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76316 and previous config saved to /var/cache/conftool/dbconfig/20250519-220201-ladsgroup.json [22:02:05] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [22:02:42] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqiad - > [22:03:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT leases) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:03:08] (03CR) 10Bking: [C:03+1] Add dse-k8s-worker10[10-11] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1147887 (https://phabricator.wikimedia.org/T394647) (owner: 10Btullis) [22:03:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1080.eqiad.wmnet with reason: host reimage [22:04:22] ryankemper@cumin2002 decommission (PID 2331732) is awaiting input [22:04:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1255 and db2241 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76317 and previous config saved to /var/cache/conftool/dbconfig/20250519-220432-ladsgroup.json [22:07:11] (03CR) 10Jdlrobson: [C:03+1] Make BundleSizeTest cross-compatible with <=1.44 and >=1.45 [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147839 (https://phabricator.wikimedia.org/T394542) (owner: 10Jgleeson) [22:08:05] (03PS1) 10Ryan Kemper: cirrus streaming updater: decom relforge100[3,4] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147893 (https://phabricator.wikimedia.org/T390565) [22:08:13] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts relforge[1003-1004].eqiad.wmnet [22:08:24] (03PS3) 10Jdlrobson: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [22:09:39] (03PS2) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [22:09:39] (03PS2) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [22:09:39] (03PS1) 10Andrew Bogott: nova-compute.conf: get even more aggressive about migrating VMs. [puppet] - 10https://gerrit.wikimedia.org/r/1147895 [22:10:17] (03PS1) 10Ryan Kemper: relforge: move to-be-decom'd into insetup [puppet] - 10https://gerrit.wikimedia.org/r/1147896 (https://phabricator.wikimedia.org/T390565) [22:10:49] (03CR) 10Bking: [C:03+1] relforge: move to-be-decom'd into insetup [puppet] - 10https://gerrit.wikimedia.org/r/1147896 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [22:10:49] (03CR) 10CI reject: [V:04-1] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [22:11:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1081.eqiad.wmnet with OS bullseye [22:11:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1081 [22:11:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1081 [22:11:48] (03CR) 10Ryan Kemper: "We'll revert this after I34fcdfa774b949fa2c450f28ef9b2bff9e7c0e59 is merged & we can run the decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/1147896 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [22:11:57] (03CR) 10Ryan Kemper: [C:03+2] relforge: move to-be-decom'd into insetup [puppet] - 10https://gerrit.wikimedia.org/r/1147896 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [22:12:39] (03CR) 10Bking: [V:03+1] cirrus streaming updater: decom relforge100[3,4] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147893 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [22:12:45] (03CR) 10Bking: [C:03+1] cirrus streaming updater: decom relforge100[3,4] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147893 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [22:12:54] (03PS2) 10Ryan Kemper: cirrus streaming updater: decom relforge100[3,4] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147893 (https://phabricator.wikimedia.org/T390565) [22:12:58] (03CR) 10Andrew Bogott: [C:03+2] nova-compute.conf: get even more aggressive about migrating VMs. [puppet] - 10https://gerrit.wikimedia.org/r/1147895 (owner: 10Andrew Bogott) [22:15:54] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:02] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:10] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:54] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:19:46] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:21:06] PROBLEM - OpenSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:21:08] PROBLEM - OpenSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:21:10] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:21:10] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:21:40] PROBLEM - OpenSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:02] PROBLEM - OpenSearch health check for shards on 9200 on relforge1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pendin [22:22:02] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:10] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:22:16] PROBLEM - OpenSearch health check for shards on 9200 on relforge1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pendin [22:22:16] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:16] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pendin [22:22:16] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:23:54] PROBLEM - nova-compute proc maximum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:25:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage [22:25:42] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:42] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:42] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:42] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:42] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9200 on relforge1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_o [22:25:42] g_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:43] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_o [22:25:43] g_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:44] ACKNOWLEDGEMENT - OpenSearch health check for shards on 9200 on relforge1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 65, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 36, delayed_unassigned_shards: 0, number_o [22:25:44] g_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 60.747663551401864 Brian_King decom T390565 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:46] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:26:33] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on relforge[1003-1004,1008-1010].eqiad.wmnet with reason: decom in progress [22:28:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1080.eqiad.wmnet with OS bullseye [22:29:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage [22:30:11] (03CR) 10Jdlrobson: [C:03+1] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [22:30:17] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1072.eqiad.wmnet with OS bookworm [22:30:45] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:30:53] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:27] PROBLEM - nova-compute proc maximum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:33] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:45] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:33] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:45] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:45] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:53] RECOVERY - nova-compute proc maximum on cloudvirt1068 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:01] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:45] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:01] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:12] (03PS3) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [22:39:12] (03PS3) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [22:39:12] (03PS1) 10Andrew Bogott: Update cloudvirt IDs for 4 cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1147898 (https://phabricator.wikimedia.org/T394671) [22:39:33] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:45] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:53] PROBLEM - nova-compute proc maximum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:40:21] (03CR) 10CI reject: [V:04-1] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [22:40:51] (03CR) 10Andrew Bogott: [C:03+2] Update cloudvirt IDs for 4 cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1147898 (https://phabricator.wikimedia.org/T394671) (owner: 10Andrew Bogott) [22:42:53] RECOVERY - nova-compute proc maximum on cloudvirt1068 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:43:09] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:44:33] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:44:53] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:45:27] RECOVERY - nova-compute proc maximum on cloudvirt1070 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:45:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye [22:45:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10837570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2007.codfw.wmnet with OS bu... [22:45:45] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:46:45] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:47:01] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250519T2300) [23:01:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1081.eqiad.wmnet with OS bullseye [23:14:46] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [23:16:44] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [23:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:28:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147900 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147900 (owner: 10TrainBranchBot) [23:40:46] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [23:42:46] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.010 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [23:50:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147900 (owner: 10TrainBranchBot) [23:55:36] (03PS1) 10RLazarus: deployment_server: Use cli-image for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/1147901 (https://phabricator.wikimedia.org/T378479) [23:56:14] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83893MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops