[00:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143976 [00:08:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143976 (owner: 10TrainBranchBot) [00:11:09] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:17:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:18:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.834 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:18:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.496 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:27:34] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143976 (owner: 10TrainBranchBot) [00:32:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:46:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:25] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:24:29] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:28:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:09] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:29:15] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:30:05] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1787, relocating_shards: 6, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [02:30:05] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:57] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [02:39:13] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms [02:45:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:48:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [02:59:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:59:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:00:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.503 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:00:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53942 bytes in 1.041 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:14:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [03:15:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [03:22:25] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:22:29] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:23:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:54:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:55:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:00:17] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:32:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:44:13] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [04:45:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [04:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10810205 (10Marostegui) @FCeratto-WMF this host needs to be recloned and added to production, can you take care of this? [05:26:43] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [05:43:38] (03PS1) 10Marostegui: installserver: Do not format db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1143985 [05:46:34] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1143985 (owner: 10Marostegui) [05:47:24] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1143985 (owner: 10Marostegui) [05:49:35] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1143988 (https://phabricator.wikimedia.org/T393296) [05:51:15] (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1143988 (https://phabricator.wikimedia.org/T393296) (owner: 10Marostegui) [05:59:29] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:59:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:19] (03PS1) 10Marostegui: db1247: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1143992 (https://phabricator.wikimedia.org/T393612) [06:09:32] (03CR) 10Marostegui: [C:03+2] db1247: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1143992 (https://phabricator.wikimedia.org/T393612) (owner: 10Marostegui) [06:25:58] (03CR) 10Filippo Giunchedi: [C:03+2] Remove fran1001.frack.eqiad.wmnet from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1143889 (https://phabricator.wikimedia.org/T392818) (owner: 10Jgreen) [06:49:32] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10810282 (10fgiunchedi) >>! In T371375#10808026, @cmooney wrote: >>>! In T371375#10807881, @cmooney wrote: >> Let me double check and report back. > > So i... [06:55:05] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10810288 (10Volans) So the error for the debmonitor client is due by the fact that in `/etc/os-release` there is no line with `VERSION_ID` yet, so th... [07:00:04] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:12:35] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Debt out of all services on: 2402 hosts [07:20:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:20:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:28:57] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:31:31] (03PS1) 10Majavah: base: notify_maintainers: Exclude tools-tofu service account [puppet] - 10https://gerrit.wikimedia.org/r/1144177 [07:31:31] (03PS1) 10Majavah: base: notify_maintainers: Identify target username in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144178 [07:36:25] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1143868 (owner: 10CDanis) [07:40:15] (03PS1) 10Slyngshede: data.yaml: Offboarding cicalese [puppet] - 10https://gerrit.wikimedia.org/r/1144419 [07:45:16] (03CR) 10Volans: [C:03+2] test-cookbook: expand help message [puppet] - 10https://gerrit.wikimedia.org/r/1143485 (owner: 10Volans) [07:46:17] (03CR) 10Volans: [C:03+2] "No, just re-adding it to cumin2002 as it was having twice the same hiera config. I'll see with Willy which final frequency he wants." [puppet] - 10https://gerrit.wikimedia.org/r/1143486 (owner: 10Volans) [07:48:44] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1144419 (owner: 10Slyngshede) [07:50:26] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding cicalese [puppet] - 10https://gerrit.wikimedia.org/r/1144419 (owner: 10Slyngshede) [07:54:16] (03CR) 10Elukey: [C:03+2] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:54:22] (03CR) 10Elukey: [C:03+2] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:54:37] (03PS15) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [07:57:26] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Cicalese out of all services on: 2402 hosts [08:01:14] (03CR) 10Slyngshede: [C:03+2] Account block: update templates [software/bitu] - 10https://gerrit.wikimedia.org/r/1143820 (https://phabricator.wikimedia.org/T393779) (owner: 10Slyngshede) [08:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:48] (03Merged) 10jenkins-bot: Account block: update templates [software/bitu] - 10https://gerrit.wikimedia.org/r/1143820 (https://phabricator.wikimedia.org/T393779) (owner: 10Slyngshede) [08:06:40] (03CR) 10Volans: [C:03+2] netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [08:06:51] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10810421 (10JMeybohm) 05Open→03Resolved Resync completed without further IO errors. [08:07:17] (03CR) 10JMeybohm: [C:03+2] Remove ci namespace from wikikube staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143802 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:08:03] (03CR) 10Volans: [C:03+2] "I'll deploy this one with the new homer release" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:09:52] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:10:49] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:11:33] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:12:34] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri) [08:13:19] (03Merged) 10jenkins-bot: Remove ci namespace from wikikube staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143802 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:17:08] (03CR) 10Federico Ceratto: [C:03+2] "I'm acking two remaining nits and implement them in future CRs without setting them as resolved as they show up in my gerrit dashboard eve" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [08:18:24] (03Merged) 10jenkins-bot: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:19:13] (03Merged) 10jenkins-bot: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [08:19:27] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:19:51] (03CR) 10MVernon: [C:03+2] apus: bring new frontend apus-fe1003 into service [puppet] - 10https://gerrit.wikimedia.org/r/1143821 (https://phabricator.wikimedia.org/T389632) (owner: 10MVernon) [08:19:56] (03PS1) 10Volans: .wmfconfig: build also for Debian Bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1144449 [08:22:27] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:22:59] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:24:42] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:25:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:27:45] (03CR) 10Vgutierrez: [C:03+1] Switch status page to haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1143868 (owner: 10CDanis) [08:28:03] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.10.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1144450 [08:28:14] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.10.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1144450 (owner: 10Volans) [08:29:44] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=apus,name=apus-fe1003.eqiad.wmnet [08:29:55] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe1003.eqiad.wmnet [08:30:47] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10810475 (10MatthewVernon) [08:31:05] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:32:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:36] (03CR) 10MVernon: [C:03+2] thanos: remove thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1143824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [08:33:45] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:34:13] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:34:39] (03PS1) 10Majavah: P:toolforge::k8s::worker: Add network tests access rules [puppet] - 10https://gerrit.wikimedia.org/r/1144452 (https://phabricator.wikimedia.org/T393775) [08:35:32] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:36:29] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10810495 (10JMeybohm) 05Open→03Resolved a:03JMeybohm All related changes have been reverted [08:39:22] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [08:39:22] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=1) rolling restart_daemons on A:thanos-fe [08:39:35] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.10.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1144450 (owner: 10Volans) [08:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:43:17] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on P{thanos-fe200[4-7]*} or P{thanos-fe1*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [08:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:16] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=1) rolling restart_daemons on P{thanos-fe200[4-7]*} or P{thanos-fe1*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [08:48:35] !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts thanos-be[2001-2003].codfw.wmnet [08:49:24] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts thanos-be[2001-2003].codfw.wmnet [08:49:36] (03PS1) 10Elukey: airflow: move to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144454 (https://phabricator.wikimedia.org/T391333) [08:49:36] (03PS1) 10Elukey: apertium: upgrade to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144455 (https://phabricator.wikimedia.org/T391333) [08:49:38] (03PS1) 10Elukey: api-gateway: upgrade to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144456 (https://phabricator.wikimedia.org/T391333) [08:49:39] (03PS1) 10Elukey: aqs-http-gateway: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144457 (https://phabricator.wikimedia.org/T391333) [08:49:41] (03PS1) 10Elukey: blunderbuss: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144458 (https://phabricator.wikimedia.org/T391333) [08:49:42] (03CR) 10FNegri: [C:03+1] base: notify_maintainers: Exclude tools-tofu service account [puppet] - 10https://gerrit.wikimedia.org/r/1144177 (owner: 10Majavah) [08:49:42] (03PS1) 10Elukey: calculator-service: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144459 (https://phabricator.wikimedia.org/T391333) [08:49:46] (03PS1) 10Elukey: changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) [08:49:51] (03PS1) 10Elukey: chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) [08:49:55] (03PS1) 10Elukey: chromium-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144462 (https://phabricator.wikimedia.org/T391333) [08:49:59] (03PS1) 10Elukey: citoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144463 (https://phabricator.wikimedia.org/T391333) [08:50:03] (03PS1) 10Elukey: cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) [08:50:07] (03PS1) 10Elukey: datahub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144465 (https://phabricator.wikimedia.org/T391333) [08:50:13] !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts thanos-fe[2001-2003].codfw.wmnet [08:50:15] (03PS1) 10Elukey: datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) [08:50:19] (03PS1) 10Elukey: developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) [08:50:23] (03PS1) 10Elukey: echoserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144468 (https://phabricator.wikimedia.org/T391333) [08:50:27] (03PS1) 10Elukey: eventgate: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144469 (https://phabricator.wikimedia.org/T391333) [08:50:31] (03PS1) 10Elukey: eventstreams: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144470 (https://phabricator.wikimedia.org/T391333) [08:50:35] (03PS1) 10Elukey: flink-app: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) [08:50:39] (03PS1) 10Elukey: function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) [08:50:43] (03PS1) 10Elukey: function-orchestrator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144473 (https://phabricator.wikimedia.org/T391333) [08:50:47] (03PS1) 10DCausse: wdqs: check max lag on wdqs-main and wdqs-sholarly [alerts] - 10https://gerrit.wikimedia.org/r/1144474 [08:51:22] (03CR) 10Ayounsi: [C:03+1] "Tested it locally and works as expected!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [08:51:40] (03CR) 10CI reject: [V:04-1] blunderbuss: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144458 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:51:53] (03CR) 10CI reject: [V:04-1] calculator-service: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144459 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:52:05] (03CR) 10CI reject: [V:04-1] changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:52:20] (03CR) 10CI reject: [V:04-1] chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:52:25] oh noes [08:52:32] (03CR) 10CI reject: [V:04-1] chromium-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144462 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:52:52] (03CR) 10CI reject: [V:04-1] cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:52:53] (03CR) 10FNegri: base: notify_maintainers: Identify target username in emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144178 (owner: 10Majavah) [08:53:04] (03CR) 10CI reject: [V:04-1] citoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144463 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:53:17] some spam incoming, I'll double check why if failed sigh [08:53:34] (03CR) 10CI reject: [V:04-1] datahub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144465 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:54:15] (03CR) 10CI reject: [V:04-1] datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:54:38] (03CR) 10CI reject: [V:04-1] developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:54:55] (03CR) 10CI reject: [V:04-1] echoserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144468 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:55:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:55:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T393870 (10MatthewVernon) 03NEW [08:55:04] (03PS1) 10Brouberol: postgresql-airflow: fix CI by moving fixture files under the PH helmfile dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144477 [08:55:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:55:35] (03CR) 10Elukey: [C:03+1] .wmfconfig: build also for Debian Bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1144449 (owner: 10Volans) [08:55:37] (03CR) 10CI reject: [V:04-1] eventgate: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144469 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:55:45] (03PS2) 10Brouberol: postgresql-airflow: fix CI by moving fixture files under the PG helmfile dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144477 [08:55:47] (03PS2) 10Majavah: base: notify_maintainers: Exclude tools-tofu service account [puppet] - 10https://gerrit.wikimedia.org/r/1144177 [08:55:47] (03PS2) 10Majavah: base: notify_maintainers: Identify target username in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144178 [08:55:59] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10810574 (10MatthewVernon) [08:56:04] (03CR) 10Majavah: base: notify_maintainers: Identify target username in emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144178 (owner: 10Majavah) [08:56:37] (03PS2) 10DCausse: wdqs: check max lag on wdqs-main and wdqs-sholarly [alerts] - 10https://gerrit.wikimedia.org/r/1144474 [08:56:51] (03CR) 10CI reject: [V:04-1] eventstreams: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144470 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:57:10] (03CR) 10CI reject: [V:04-1] flink-app: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:58:07] (03CR) 10CI reject: [V:04-1] function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:58:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1144452 (https://phabricator.wikimedia.org/T393775) (owner: 10Majavah) [08:59:05] (03CR) 10CI reject: [V:04-1] function-orchestrator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144473 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:00:01] !log mvernon@cumin1002 START - Cookbook sre.dns.netbox [09:01:51] (03CR) 10Volans: [C:03+2] .wmfconfig: build also for Debian Bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1144449 (owner: 10Volans) [09:02:30] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::worker: Add network tests access rules [puppet] - 10https://gerrit.wikimedia.org/r/1144452 (https://phabricator.wikimedia.org/T393775) (owner: 10Majavah) [09:02:37] (03CR) 10Volans: [C:03+2] WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [09:03:56] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002" [09:04:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002" [09:04:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thanos-fe[2001-2003].codfw.wmnet [09:04:49] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10810595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: `thanos-fe[2001-2003].codfw.wmnet` - thanos-fe2001.codfw.wmnet (**PASS**) - Downti... [09:11:13] (03Merged) 10jenkins-bot: .wmfconfig: build also for Debian Bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1144449 (owner: 10Volans) [09:14:07] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873 (10Jelto) 03NEW [09:14:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 06Release-Engineering-Team, 10vm-requests: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10810632 (10Jelto) [09:14:54] (03CR) 10JMeybohm: [C:03+1] postgresql-airflow: fix CI by moving fixture files under the PG helmfile dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144477 (owner: 10Brouberol) [09:15:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 06Release-Engineering-Team, 10vm-requests: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10810634 (10Jelto) [09:15:43] (03CR) 10Brouberol: [C:03+2] postgresql-airflow: fix CI by moving fixture files under the PG helmfile dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144477 (owner: 10Brouberol) [09:16:58] (03PS1) 10STran: htmlform: Fix rendering contents for cloner fields [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) [09:17:45] (03Merged) 10jenkins-bot: postgresql-airflow: fix CI by moving fixture files under the PG helmfile dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144477 (owner: 10Brouberol) [09:20:06] (03CR) 10Majavah: [C:03+2] base: notify_maintainers: Exclude tools-tofu service account [puppet] - 10https://gerrit.wikimedia.org/r/1144177 (owner: 10Majavah) [09:20:32] (03CR) 10FNegri: [C:03+1] base: notify_maintainers: Identify target username in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144178 (owner: 10Majavah) [09:20:50] (03CR) 10Majavah: [C:03+2] base: notify_maintainers: Identify target username in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144178 (owner: 10Majavah) [09:24:22] (03CR) 10Slyngshede: Initial implementation of VueJS frontend (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [09:24:33] (03CR) 10Slyngshede: [C:03+2] Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [09:25:09] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [09:25:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10810651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [09:27:25] (03Merged) 10jenkins-bot: Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [09:28:31] (03CR) 10FNegri: [C:03+2] wikireplicas: maintain-views should not create _p db [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri) [09:30:54] (03PS2) 10Elukey: blunderbuss: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144458 (https://phabricator.wikimedia.org/T391333) [09:30:54] (03PS2) 10Elukey: calculator-service: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144459 (https://phabricator.wikimedia.org/T391333) [09:30:55] (03PS2) 10Elukey: changeprop: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144460 (https://phabricator.wikimedia.org/T391333) [09:30:55] (03PS2) 10Elukey: chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) [09:30:56] (03PS2) 10Elukey: chromium-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144462 (https://phabricator.wikimedia.org/T391333) [09:30:59] (03PS2) 10Elukey: citoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144463 (https://phabricator.wikimedia.org/T391333) [09:31:03] (03PS2) 10Elukey: cxserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144464 (https://phabricator.wikimedia.org/T391333) [09:31:07] (03PS2) 10Elukey: datahub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144465 (https://phabricator.wikimedia.org/T391333) [09:31:11] (03PS2) 10Elukey: datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) [09:31:19] (03PS2) 10Elukey: developer-portal: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144467 (https://phabricator.wikimedia.org/T391333) [09:31:23] (03PS2) 10Elukey: echoserver: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144468 (https://phabricator.wikimedia.org/T391333) [09:31:27] (03PS2) 10Elukey: eventgate: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144469 (https://phabricator.wikimedia.org/T391333) [09:31:31] (03PS2) 10Elukey: eventstreams: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144470 (https://phabricator.wikimedia.org/T391333) [09:31:35] (03PS2) 10Elukey: flink-app: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) [09:31:43] (03PS2) 10Elukey: function-evaluator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144472 (https://phabricator.wikimedia.org/T391333) [09:31:47] (03PS2) 10Elukey: function-orchestrator: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144473 (https://phabricator.wikimedia.org/T391333) [09:34:17] (03PS1) 10SD0001: Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) [09:36:29] (03CR) 10Máté Szabó: [C:03+1] htmlform: Fix rendering contents for cloner fields [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) (owner: 10STran) [09:36:36] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-fe1007.eqiad.wmnet with OS bullseye [09:36:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10810694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye... [09:37:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) (owner: 10STran) [09:37:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10810699 (10MatthewVernon) thanos-fe1007 looks like it's not even trying to PXE at the moment, so maybe the 10g card needs setting up to PXE on this s... [09:38:19] (03CR) 10CI reject: [V:04-1] datasets-config: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:46:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10810721 (10cmooney) [09:48:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [09:50:55] (03CR) 10JMeybohm: [C:03+1] helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:56:13] (03PS1) 10Gergő Tisza: Do not do unnecessary fallback during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144488 (https://phabricator.wikimedia.org/T393621) [09:56:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144488 (https://phabricator.wikimedia.org/T393621) (owner: 10Gergő Tisza) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1000) [10:01:04] (03PS1) 10Ayounsi: LibreNMS report: small fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144491 [10:03:22] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144466 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [10:03:38] (03Abandoned) 10Ayounsi: LibreNMS report: various fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1091212 (https://phabricator.wikimedia.org/T379907) (owner: 10Ayounsi) [10:07:06] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1142517 (owner: 10Majavah) [10:08:04] !log Ran fixStuckGlobalRename.php for T393877 [10:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:07] T393877: Unblock stuck global rename of Singasarská - https://phabricator.wikimedia.org/T393877 [10:16:07] (03CR) 10Ladsgroup: [C:03+1] mw::maintenance: migrate db_lag_stats_reporter to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [10:16:21] (03CR) 10Ladsgroup: [C:03+1] "Let's monitor and see" [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [10:17:59] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144491 (owner: 10Ayounsi) [10:18:24] (03CR) 10Ayounsi: [C:03+2] LibreNMS report: small fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144491 (owner: 10Ayounsi) [10:20:33] (03Merged) 10jenkins-bot: LibreNMS report: small fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144491 (owner: 10Ayounsi) [10:21:05] (03PS2) 10Lucas Werkmeister (WMDE): manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) [10:21:47] (03PS1) 10Volans: Release v0.10.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144493 [10:22:21] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:22:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:25:25] (03PS1) 10Gergő Tisza: Do not use $_SESSION [extensions/LiquidThreads] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144495 (https://phabricator.wikimedia.org/T29887) [10:25:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/LiquidThreads] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144495 (https://phabricator.wikimedia.org/T29887) (owner: 10Gergő Tisza) [10:28:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143962 (https://phabricator.wikimedia.org/T124371) (owner: 10Gergő Tisza) [10:28:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:30:55] (03CR) 10Gergő Tisza: "On second thought on don't think we need to be this cautious here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:31:05] (03PS4) 10Gergő Tisza: Set wgPHPSessionHandling to 'warn' on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:31:39] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS bullseye [10:32:22] !log delete some exterminated cables from Netbox - T393188 [10:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] T393188: Netbox: unterminated cables - https://phabricator.wikimedia.org/T393188 [10:33:47] (03CR) 10Máté Szabó: "nit: I think we should update the commit title now, as we'll be deploying this more widely." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:33:55] (03PS5) 10Gergő Tisza: Set wgPHPSessionHandling to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:33:55] (03PS1) 10Gergő Tisza: Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) [10:34:14] (03CR) 10Gergő Tisza: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [10:34:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [10:34:55] (03CR) 10Gergő Tisza: "(Won't deploy it yet, only test.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [10:35:31] (03PS10) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [10:35:56] (03CR) 10Máté Szabó: [C:03+1] Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [10:38:02] (03PS1) 10Majavah: base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 [10:40:45] (03CR) 10CI reject: [V:04-1] base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 (owner: 10Majavah) [10:40:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:41:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:41:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75895 and previous config saved to /var/cache/conftool/dbconfig/20250512-104116-fceratto.json [10:42:32] (03PS2) 10Majavah: base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 [10:44:27] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [10:47:42] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [10:48:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75896 and previous config saved to /var/cache/conftool/dbconfig/20250512-104803-fceratto.json [10:51:44] (03PS1) 10Ayounsi: Cables report: alert on unterminated cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144499 (https://phabricator.wikimedia.org/T393188) [10:52:44] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "other than the question inlined, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) (owner: 10Majavah) [10:54:36] (03PS1) 10Hnowlan: mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) [10:55:39] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) (owner: 10Hnowlan) [10:56:13] (03PS2) 10Majavah: P:wmcs: cloudgw: Refuse outbound mail via NAT [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) [10:56:13] (03PS2) 10Majavah: P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) [10:56:42] (03PS2) 10Hnowlan: mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) [10:57:22] (03CR) 10Majavah: P:wmcs: cloudgw: Refuse outbound mail via NAT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) (owner: 10Majavah) [10:57:53] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) (owner: 10Hnowlan) [10:58:49] (03CR) 10Ayounsi: "Tested on Netbox-next : https://netbox-next.wikimedia.org/extras/scripts/results/117681/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144499 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [10:59:53] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144499 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [11:00:10] (03CR) 10Ayounsi: [C:03+2] Cables report: alert on unterminated cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144499 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [11:00:24] (03PS3) 10Hnowlan: mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) [11:01:45] (03PS1) 10Ladsgroup: objectcache: Cast explicitly to integer [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144502 (https://phabricator.wikimedia.org/T393879) [11:02:04] (03Merged) 10jenkins-bot: Cables report: alert on unterminated cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144499 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [11:03:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P75897 and previous config saved to /var/cache/conftool/dbconfig/20250512-110310-fceratto.json [11:03:37] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:03:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1088.eqiad.wmnet with OS bullseye [11:04:03] (03PS7) 10JMeybohm: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) [11:06:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:08:03] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:08:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:09:39] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144493 (owner: 10Volans) [11:09:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:58] (03CR) 10Volans: [C:03+2] Release v0.10.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144493 (owner: 10Volans) [11:10:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.554 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:18] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.0 - volans@cumin1003 [11:11:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:12:25] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.0 - volans@cumin1003 [11:12:57] jouncebot: nowandnext [11:12:57] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [11:12:57] In 0 hour(s) and 47 minute(s): Debugging for T392251 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1200) [11:12:59] T392251: SessionBackend seems to store session changes too often - https://phabricator.wikimedia.org/T392251 [11:16:08] (03PS1) 10Hnowlan: mw::maintenance: migrate cleanupUploadStash job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) [11:16:34] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Revert to v0.9.0 - volans@cumin1003 [11:17:39] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Revert to v0.9.0 - volans@cumin1003 [11:18:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P75898 and previous config saved to /var/cache/conftool/dbconfig/20250512-111817-fceratto.json [11:22:59] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Revert to v0.9.0 - volans@cumin1003 [11:25:26] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Revert to v0.9.0 - volans@cumin1003 [11:31:38] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) (owner: 10Majavah) [11:32:54] (03CR) 10Majavah: [C:03+2] P:wmcs: cloudgw: Refuse outbound mail via NAT [puppet] - 10https://gerrit.wikimedia.org/r/1137757 (https://phabricator.wikimedia.org/T366936) (owner: 10Majavah) [11:33:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75899 and previous config saved to /var/cache/conftool/dbconfig/20250512-113324-fceratto.json [11:33:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:33:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T392806)', diff saved to https://phabricator.wikimedia.org/P75900 and previous config saved to /var/cache/conftool/dbconfig/20250512-113350-fceratto.json [11:40:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T392806)', diff saved to https://phabricator.wikimedia.org/P75901 and previous config saved to /var/cache/conftool/dbconfig/20250512-114038-fceratto.json [11:41:31] (03PS1) 10Brouberol: airflow-wmde: remove postgresql from helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144514 [11:42:02] (03CR) 10Majavah: [C:03+2] Revert "common: Temporarily remove some keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142517 (owner: 10Majavah) [11:42:05] 06SRE, 06cloud-services-team, 10Horizon, 06serviceops, 10Striker: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478#10811004 (10aborrero) p:05Triage→03Low [11:42:35] (03Merged) 10jenkins-bot: Revert "common: Temporarily remove some keys" [homer/public] - 10https://gerrit.wikimedia.org/r/1142517 (owner: 10Majavah) [11:42:43] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144514 (owner: 10Brouberol) [11:43:19] (03CR) 10Brouberol: [C:03+2] airflow-wmde: remove postgresql from helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144514 (owner: 10Brouberol) [11:44:11] (03CR) 10FNegri: [C:03+1] base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 (owner: 10Majavah) [11:44:38] (03PS3) 10Majavah: base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 [11:47:42] (03CR) 10Majavah: [C:03+2] base: puppet_alert: Drop decimals from seconds in emails [puppet] - 10https://gerrit.wikimedia.org/r/1144498 (owner: 10Majavah) [11:49:19] (03PS8) 10JMeybohm: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) [11:52:10] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10811059 (10BTullis) It looks like one of the disks has certainly failed. From here: https://wikitech.wikimedia.org/wiki/MegaCli#Di... [11:55:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P75902 and previous config saved to /var/cache/conftool/dbconfig/20250512-115545-fceratto.json [11:56:19] (03PS1) 10Hnowlan: mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) [11:57:23] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:58:43] (03PS2) 10Hnowlan: mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) [11:59:48] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [12:00:05] tgr: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Debugging for T392251 . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1200). [12:00:06] T392251: SessionBackend seems to store session changes too often - https://phabricator.wikimedia.org/T392251 [12:00:45] (03CR) 10MSantos: rest-gateway: route reading lists API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [12:00:53] (03PS3) 10Hnowlan: mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) [12:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:07] (03PS2) 10Hnowlan: rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) [12:03:24] (03CR) 10Hnowlan: rest-gateway: route reading lists API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [12:05:17] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10811102 (10PMG) @jcrespo - I tried to reupload them in same place, but I got error message th... [12:08:15] (03PS1) 10Btullis: Bump nodemanager heap on the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) [12:08:44] (03CR) 10Gmodena: [C:03+1] "LGTM. Thanks Luca!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144471 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:08:59] (03PS1) 10Gkyziridis: ml-inference-services: edit-check experirmental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) [12:09:54] (03CR) 10Kamila Součková: [C:03+2] mw-cron/updatequerypages: Migrate Mostcategories,Mostlinkedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1143803 (https://phabricator.wikimedia.org/T388534) (owner: 10Kamila Součková) [12:09:57] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM to my untrained eye!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144454 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [12:10:37] jouncebot: nowandnext [12:10:37] For the next 0 hour(s) and 49 minute(s): Debugging for T392251 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1200) [12:10:37] In 0 hour(s) and 49 minute(s): UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1300) [12:10:38] T392251: SessionBackend seems to store session changes too often - https://phabricator.wikimedia.org/T392251 [12:10:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P75903 and previous config saved to /var/cache/conftool/dbconfig/20250512-121053-fceratto.json [12:11:01] is anyone actually doing the debugging? [12:11:08] yes [12:11:14] cool [12:11:23] are you doing something urgent? [12:11:35] nah, it's urgent but it can wait until midnight [12:11:54] so no worries [12:11:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:59] ok, thx [12:12:07] !incidents [12:12:08] 6114 (UNACKED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [12:12:13] !ack 6114 [12:12:14] 6114 (ACKED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [12:12:16] looking [12:12:21] Woot [12:12:40] Checking too [12:13:28] I feel this is not how the spiderpig UI was intended to look: https://phabricator.wikimedia.org/F59911386 [12:13:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 6 unhealthy realservers pooled on lvs5006:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [12:13:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:14:08] marostegui: I'm checking the live traffic in eqsin [12:14:20] probably better to move to the private chan [12:14:26] ok, I can click on it to see the actual error. Still a bit weird. [12:14:33] volans: agreed [12:14:37] !incidents [12:14:38] 6114 (ACKED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [12:14:38] 6115 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:14:43] !ack 6115 [12:14:44] 6115 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:14:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:14:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811131 (10cmooney) [12:15:18] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable rrrla model in idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [12:15:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:17:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5524/co" [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [12:18:18] (03CR) 10Ilias Sarantopoulos: "everything looks good -- just one suggestion to switch the image with the most recent one" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [12:18:21] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:18:27] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:18:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 6 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [12:18:34] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143962 (https://phabricator.wikimedia.org/T124371) (owner: 10Gergő Tisza) [12:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/LiquidThreads] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144495 (https://phabricator.wikimedia.org/T29887) (owner: 10Gergő Tisza) [12:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [12:19:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:19:54] (03Merged) 10jenkins-bot: Get rid of ancient session_name call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143962 (https://phabricator.wikimedia.org/T124371) (owner: 10Gergő Tisza) [12:20:16] (03Merged) 10jenkins-bot: Do not use $_SESSION [extensions/LiquidThreads] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144495 (https://phabricator.wikimedia.org/T29887) (owner: 10Gergő Tisza) [12:20:42] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:21:02] (03PS1) 10Gergő Tisza: Improve session logging [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144532 (https://phabricator.wikimedia.org/T393038) [12:21:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144532 (https://phabricator.wikimedia.org/T393038) (owner: 10Gergő Tisza) [12:21:44] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144533 [12:21:58] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:24:44] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:25:13] (03PS2) 10SD0001: Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) [12:25:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:25:58] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:26:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T392806)', diff saved to https://phabricator.wikimedia.org/P75904 and previous config saved to /var/cache/conftool/dbconfig/20250512-122600-fceratto.json [12:26:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:26:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75905 and previous config saved to /var/cache/conftool/dbconfig/20250512-122626-fceratto.json [12:29:36] (03PS1) 10Slyngshede: P:idm enable django-vite [puppet] - 10https://gerrit.wikimedia.org/r/1144550 (https://phabricator.wikimedia.org/T391443) [12:30:06] (03CR) 10MSantos: [C:03+1] "Great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [12:30:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811243 (10cmooney) [12:31:23] (03CR) 10Bartosz Wójtowicz: "It's very nice to see the ml-services deployments in action! 😊 Left 1 small question from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [12:32:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75906 and previous config saved to /var/cache/conftool/dbconfig/20250512-123211-fceratto.json [12:32:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811259 (10cmooney) [12:32:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5525/co" [puppet] - 10https://gerrit.wikimedia.org/r/1144550 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [12:32:52] (03PS1) 10Filippo Giunchedi: zuul: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) [12:32:53] (03PS1) 10Filippo Giunchedi: airflow: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) [12:32:55] (03PS1) 10Filippo Giunchedi: graphite: remove access to port 2003 tcp/udp [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) [12:32:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:11] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5526/console" [puppet] - 10https://gerrit.wikimedia.org/r/1144550 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [12:34:18] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [12:35:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811287 (10cmooney) [12:35:13] (03CR) 10Gergő Tisza: [C:03+2] Set wgPHPSessionHandling to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [12:35:59] (03CR) 10Filippo Giunchedi: "Lucas, heads up FYI once this is merged the wmde analytics script nc connections will start failing" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [12:36:00] (03Merged) 10jenkins-bot: Set wgPHPSessionHandling to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140725 (https://phabricator.wikimedia.org/T362324) (owner: 10Máté Szabó) [12:36:23] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1143962|Get rid of ancient session_name call (T124371)]], [[gerrit:1144495|Do not use $_SESSION (T29887 T124371)]], [[gerrit:1140725|Set wgPHPSessionHandling to 'warn' (T362324)]] [12:36:33] T124371: Clean up usage of $_SESSION in WMF-deployed extensions - https://phabricator.wikimedia.org/T124371 [12:36:33] T29887: Replying to a thread doesn't work the first time - https://phabricator.wikimedia.org/T29887 [12:36:35] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [12:39:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-swift.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:40:02] (03PS1) 10Ayounsi: PuppetImport and offline scripts: delete cable before interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) [12:40:35] (03CR) 10Kamila Součková: mw::maintenance: move updateMenteeData to upper level job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [12:40:58] (03CR) 10DCausse: [C:04-2] "lgtm, thanks! marking as -2 for now waiting for the new code to land in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [12:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:41:51] (03CR) 10Clément Goubert: [C:03+1] "This is overriden by `global-{eqiad,codfw}.yaml`, meaning there is no `mcrouter` deployment on `mw-cron`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143517 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli) [12:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:43:31] (03CR) 10Clément Goubert: [C:03+1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [12:44:29] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) (owner: 10Hnowlan) [12:45:05] Emperor: FYI in case it wasn't already mentioned, the above alert for a puppet cert expiring: https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire (probably replaced by pki?) [12:45:07] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [12:45:25] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143528 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [12:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P75907 and previous config saved to /var/cache/conftool/dbconfig/20250512-124718-fceratto.json [12:48:26] (03CR) 10Clément Goubert: "I think using the compact cron syntax is more legible for when the job runs, but up to you." [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [12:48:35] (03CR) 10Volans: [C:03+1] "LGTM, can be tested on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [12:50:09] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10811350 (10elukey) [12:51:53] (03PS1) 10Volans: setup.py: include the graphql query files [software/homer] - 10https://gerrit.wikimedia.org/r/1144561 [12:52:41] !log tgr@deploy1003 tgr, mszabo: Backport for [[gerrit:1143962|Get rid of ancient session_name call (T124371)]], [[gerrit:1144495|Do not use $_SESSION (T29887 T124371)]], [[gerrit:1140725|Set wgPHPSessionHandling to 'warn' (T362324)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:52:46] T124371: Clean up usage of $_SESSION in WMF-deployed extensions - https://phabricator.wikimedia.org/T124371 [12:52:46] T29887: Replying to a thread doesn't work the first time - https://phabricator.wikimedia.org/T29887 [12:52:47] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [12:53:18] (03PS1) 10Kamila Součková: mw-cron/UpdatePeriodicMetrics-per-wiki: fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1144562 (https://phabricator.wikimedia.org/T388542) [12:53:38] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144562 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [12:53:46] (03CR) 10Majavah: [C:03+1] sssd: increase internal timeouts for be, pam, sudo [puppet] - 10https://gerrit.wikimedia.org/r/1143963 (https://phabricator.wikimedia.org/T393732) (owner: 10Andrew Bogott) [12:54:49] (03CR) 10Andrew Bogott: [C:03+2] "noting that the pcc failure is largely meaningless since it ran on prod hosts where this isn't applied." [puppet] - 10https://gerrit.wikimedia.org/r/1143963 (https://phabricator.wikimedia.org/T393732) (owner: 10Andrew Bogott) [12:57:18] volans: Hm, I'm a bit confused. If I use openssl to talk to e.g. thanos-fe1001 I get back a Cert dated from May 7 2025 to Jun 4 2025, issuer Issuer: C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery [12:57:46] And that's a cert with CN = thanos-swift.discovery.wmnet [12:58:14] yes I guess obsolete, but better safe than sorry ;) [12:58:16] volans: but the alert says 'Puppet CA certificate thanos-swift.discovery.wmnet will expire in 6d 23h 40m 9s' [12:58:17] (03CR) 10Ayounsi: [C:03+1] setup.py: include the graphql query files [software/homer] - 10https://gerrit.wikimedia.org/r/1144561 (owner: 10Volans) [12:58:35] volans: yes it can be deleted, we are not using it anymore [12:58:38] volans: so I should just silence the alert for a week? [12:58:49] or is there some bit of legacy guff that needs removing? [12:58:51] Emperor: nono we can drop directly the cert from puppet 5's CA [12:59:04] ^^^ [12:59:08] !log tgr@deploy1003 tgr, mszabo: Continuing with sync [12:59:55] ah, cool, please do so then :) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1300). [13:00:05] Tran and flanoz: A patch you scheduled for UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:15] 👋 [13:00:20] I can deploy! [13:00:30] I'll deploy one more patch at the end [13:00:38] I'm here with neslihan (flanoz) [13:01:13] new nick for flanoz^ [13:01:14] hi there! If a UI interface is more your style, remember SpiderPig is already available for backport deployments! https://spiderpig.wikimedia.org/ [13:01:16] the previous one took 50 min and is still not done, but probably more to do with bad luck than with spiderpig [13:01:27] yes I am around :) [13:01:50] !log `puppet ca destroy thanos.discovery.wmnet` on puppetmaster1001 - old cert not used anymore [13:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:53] cc: Emperor, volans --^ [13:01:57] thanks luca [13:02:19] tgr_: ack [13:02:23] in case someone wants something to deploy, T393836 could use a backport. unfortunately I don't have time to babysit that at the moment [13:02:24] T393836: Creating accounts on votewiki results in error, does not send email, but is created anyway - https://phabricator.wikimedia.org/T393836 [13:02:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P75908 and previous config saved to /var/cache/conftool/dbconfig/20250512-130225-fceratto.json [13:03:01] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1144562 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:03:27] my patch should be testable on testwiki if someone can grant me election admin rights otherwise it can be shipped to prod 🙈 and should theoretically only affect votewiki, where I can test it as I currently have rights. [13:03:46] I can also deploy myself but would be happy to not be responsible for that [13:03:50] Tran: we can test it on votewiki, mwdebug should work there [13:03:57] oh perfect TIL [13:04:01] but right now we’re waiting for the previous deployment to finish anyway [13:04:12] though I guess we could merge your backport already [13:04:17] let me see how long CI usually takes there [13:04:35] 👍 thank you! [13:04:36] elukey: thanks :) [13:04:46] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-swift.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:05:04] ca. 10 minutes (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1143171), then let’s kick it off already [13:05:21] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) (owner: 10STran) [13:05:26] (03PS4) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) [13:05:26] (03PS4) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [13:05:26] (03PS15) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [13:05:48] (03CR) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:06:30] Our change is not needed to be tested https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1141852 [13:06:42] (03PS2) 10Ayounsi: PuppetImport and offline scripts: delete cable before interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) [13:07:15] (03CR) 10Volans: [C:03+2] setup.py: include the graphql query files [software/homer] - 10https://gerrit.wikimedia.org/r/1144561 (owner: 10Volans) [13:07:46] (03CR) 10Ayounsi: "Tested on netbox-next through offline device (found an issue fixed in PS2)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [13:08:27] (03CR) 10Lucas Werkmeister (WMDE): "I feel like it’s a bit early to add this config change… AFAICT the feature flag doesn’t exist yet (not in Wikibase master, and also Id2e03" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [13:08:28] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: 504 handling, weighted tags rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135019 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:08:33] (03CR) 10Volans: [C:03+1] "Makes sense, thx" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [13:08:36] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143962|Get rid of ancient session_name call (T124371)]], [[gerrit:1144495|Do not use $_SESSION (T29887 T124371)]], [[gerrit:1140725|Set wgPHPSessionHandling to 'warn' (T362324)]] (duration: 32m 12s) [13:08:41] T124371: Clean up usage of $_SESSION in WMF-deployed extensions - https://phabricator.wikimedia.org/T124371 [13:08:42] T29887: Replying to a thread doesn't work the first time - https://phabricator.wikimedia.org/T29887 [13:08:42] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:08:47] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 958559840 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:08:56] (03CR) 10Ayounsi: [C:03+2] PuppetImport and offline scripts: delete cable before interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [13:09:03] tgr_: do you want to deploy anything else or can I take over? [13:09:08] (other than the change you mentioned you’d do at the end ^^) [13:09:47] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 8648 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:09:48] Lucas_WMDE: no, I'll wait for you with the rest [13:10:00] ok :) [13:10:07] (also good luck, the previous scap took 60 minutes) [13:10:12] :( [13:10:17] (03Merged) 10jenkins-bot: Search update pipeline: 504 handling, weighted tags rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135019 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [13:10:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) (owner: 10STran) [13:10:48] (03Merged) 10jenkins-bot: PuppetImport and offline scripts: delete cable before interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144556 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [13:10:49] (03PS1) 10Klausman: Revert "thanos/swift: at pseudo secrets for mint_ro" [labs/private] - 10https://gerrit.wikimedia.org/r/1144569 [13:11:08] tgr_: I am pretty sure it took long because that was the first of the day [13:11:34] and due to the base image being rebuild over the week-end, that leads to a full image rebuild and deploy which indeed takes age [13:11:37] (03CR) 10Klausman: [V:03+2 C:03+2] Revert "thanos/swift: at pseudo secrets for mint_ro" [labs/private] - 10https://gerrit.wikimedia.org/r/1144569 (owner: 10Klausman) [13:12:09] hashar: gergo did a few deploys one hour ago [13:12:29] sorry, I misread, you're talking about the same :) [13:12:33] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:12:43] thanks for rolling out those CI output improvements by Bartosz. Very nice! [13:12:50] also gerrit got stuck [13:12:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:13:02] ref T393847 [13:13:03] T393847: Improve brevity of Jenkins console output - https://phabricator.wikimedia.org/T393847 [13:13:19] also also spiderpig sometimes notifies you that you need to press a button and sometimes not, which is confusing [13:13:29] argh [13:13:34] that is already 3 different tasks :] [13:13:58] the gerrit one has been around forever [13:14:13] sometimes it gets stuck in Ready to Submit [13:14:37] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:14:46] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:14:56] maybe scap backport should detect that, but I imagine it's not trivial as it can be a legitimate state when there are interdependencies [13:15:14] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:15:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:15:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:16:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:17:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75909 and previous config saved to /var/cache/conftool/dbconfig/20250512-131731-fceratto.json [13:17:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:17:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T392806)', diff saved to https://phabricator.wikimedia.org/P75910 and previous config saved to /var/cache/conftool/dbconfig/20250512-131756-fceratto.json [13:18:03] (03Merged) 10jenkins-bot: htmlform: Fix rendering contents for cloner fields [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144481 (https://phabricator.wikimedia.org/T393790) (owner: 10STran) [13:18:19] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1144481|htmlform: Fix rendering contents for cloner fields (T393790)]] [13:18:22] T393790: SecurePoll can no longer add questions to new polls - https://phabricator.wikimedia.org/T393790 [13:21:48] (03PS3) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) [13:22:45] (03CR) 10JMeybohm: [C:03+2] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [13:22:54] !log lucaswerkmeister-wmde@deploy1003 stran, lucaswerkmeister-wmde: Backport for [[gerrit:1144481|htmlform: Fix rendering contents for cloner fields (T393790)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10811562 (10Jhancock.wm) @elukey do we still need this ticket open for testing? [13:23:31] Tran: please test using WikimediaDebug :) [13:23:36] (03CR) 10Ssingh: [C:03+1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:23:47] ack, please hold [13:23:57] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10811564 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:24:02] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:24:52] Krinkle: thanks to matmarex indeed :] [13:25:10] It's good to go 🎉 [13:25:15] !log lucaswerkmeister-wmde@deploy1003 stran, lucaswerkmeister-wmde: Continuing with sync [13:25:18] \o/ thanks for testing! [13:25:26] thanks for deploying! [13:25:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T392806)', diff saved to https://phabricator.wikimedia.org/P75911 and previous config saved to /var/cache/conftool/dbconfig/20250512-132552-fceratto.json [13:26:21] jnuche: FYI the browser notifications from spiderpig (T392487) seem to be a bit flaky :/ [13:26:21] T392487: Add browser notification when deployment is awaiting user interaction - https://phabricator.wikimedia.org/T392487 [13:26:29] tgr_: mentioned above that it “sometimes notifies you that you … and sometimes not, which is confusing” [13:26:36] and that seems to match my experience as well [13:26:52] maybe cause they poll every X minues? [13:26:56] I think I got a few notifications but not as many as I should have [13:27:06] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10811587 (10cmooney) >>! In T392686#10786597, @Andrew wrote: > + @cmooney because I bet he can fix this in 5 seconds Yeah manually delete... [13:27:34] (03CR) 10Kamila Součková: [C:03+2] mw-cron/UpdatePeriodicMetrics-per-wiki: fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1144562 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [13:28:19] (03PS1) 10Andrew Bogott: cloud-vps sssd.conf: increase timeout for nss section [puppet] - 10https://gerrit.wikimedia.org/r/1144572 (https://phabricator.wikimedia.org/T393732) [13:28:38] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10811597 (10Jhancock.wm) @MatthewVernon is this ticket close-able? or is there still testing going on here? [13:29:02] flanoz, joelyrookewmde: I left a comment for you on that config change btw [13:29:03] (03CR) 10Majavah: [C:03+1] cloud-vps sssd.conf: increase timeout for nss section [puppet] - 10https://gerrit.wikimedia.org/r/1144572 (https://phabricator.wikimedia.org/T393732) (owner: 10Andrew Bogott) [13:29:12] Lucas_WMDE, tgr_: SpiderPig won't notify you about the first prompt to confirm you want to go ahead with the backport. It should notify you of any other prompts as long as you are the one who kicked off the backport [13:29:20] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:29:28] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:29:39] do you recall which prompts didn't get a notification for you? [13:31:25] I got a notification for this: T393885 [13:31:25] T393885: SpiderPig fails to show warning for Depends-On confusion in summary view - https://phabricator.wikimedia.org/T393885 [13:31:39] but then no notification for landing on the testservers, I think [13:31:56] jnuche: I think I didn’t get one for the change being ready on WikimediaDebug just now, for instance [13:31:57] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps sssd.conf: increase timeout for nss section [puppet] - 10https://gerrit.wikimedia.org/r/1144572 (https://phabricator.wikimedia.org/T393732) (owner: 10Andrew Bogott) [13:33:10] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144481|htmlform: Fix rendering contents for cloner fields (T393790)]] (duration: 14m 50s) [13:33:13] T393790: SecurePoll can no longer add questions to new polls - https://phabricator.wikimedia.org/T393790 [13:33:49] (03PS6) 10Btullis: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [13:33:58] @Lucas_WMDE thanks! We feel like publish our change up before it is in use (so we intentionally keep the flag false) as we need to get Translate Wiki changes ready before making the feature available for pilot wikis. [13:34:04] (03CR) 10CI reject: [V:04-1] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [13:34:52] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [13:34:53] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:35:07] Lucas_WMDE, tgr_: next time that happens to you, can you file a task with the details? The interaction/prompt that failed to notify and which browser are you using. Also please note the notifications will only work from the browser where you started the backport, and will stop working if you clear the browser's state [13:35:19] flanoz: okay, but what’s the benefit of adding it to the production config at the moment? [13:35:31] (03CR) 10Btullis: team-search-platform: Add alert for wdqs-categories lag (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [13:35:40] I assume the flag should default to false in Wikibase anyway [13:35:57] so the production change should make no difference yet [13:36:08] jnuche: I'll test it later today. Is there an expected delay between scap showing the prompt and the notification happening? [13:36:30] nope, should be immediate [13:37:13] (03PS1) 10Alexandros Kosiaris: [DNM]: Add mw-wikifunctions-ro to deployment server listeners [puppet] - 10https://gerrit.wikimedia.org/r/1144577 (https://phabricator.wikimedia.org/T389375) [13:37:27] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10811682 (10elukey) My understanding is that Pyrra (and other tools like Sloth) assume a rolling time window for the SLO. This means that if we select 12 weeks... [13:39:07] Lucas_WMDE so do you suggest using the flag anyways in Wikibase changes without creating the setting and there will be no issues as it will already be false. [13:39:43] flanoz: yes, I think so [13:39:56] there should be no need to add it to the production config until you want to turn it on somewhere (beta, pilot wikis, whatever ^^) [13:40:42] and I’m slightly worried about adding a flag to the production config before it even exists in the code – e.g. the Wikibase flag might get renamed during code review, and then the production config ends up with an unused variable that doesn’t get removed for years because nobody notices it ^^ [13:41:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P75912 and previous config saved to /var/cache/conftool/dbconfig/20250512-134100-fceratto.json [13:42:58] Lucas_WMDE thanks! This makes sense. So I will put it on schedule when we need to turn it on for pilot wikis then. [13:43:05] alright 👍 [13:43:17] tgr_: I think you can go ahead with your deployment then [13:43:38] (03PS1) 10AOkoth: deployment: add miscweb aux deploy user [puppet] - 10https://gerrit.wikimedia.org/r/1144578 (https://phabricator.wikimedia.org/T350794) [13:44:15] thanks Lucas_WMDE [13:44:30] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10811719 (10elukey) 05Open→03Resolved a:03elukey We can definitely close it thanks! [13:44:44] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144577 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [13:44:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10811723 (10elukey) 05Open→03Resolved a:03elukey [13:45:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144532 (https://phabricator.wikimedia.org/T393038) (owner: 10Gergő Tisza) [13:45:28] !log hashar@deploy1003 Started deploy [integration/docroot@21bebf5]: build: Updating mediawiki/mediawiki-codesniffer to 47.0.0 [13:45:40] !log hashar@deploy1003 Finished deploy [integration/docroot@21bebf5]: build: Updating mediawiki/mediawiki-codesniffer to 47.0.0 (duration: 00m 11s) [13:45:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10811727 (10elukey) 05Open→03Resolved We decided to keep going with the new controller and retro-fit all the ms-be Supermicro n... [13:47:07] FIRING: [3x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:47:34] (03PS2) 10Bking: Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) [13:47:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:48:03] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10811736 (10elukey) Summary - we were able to solve the cabling issue, and after a long review (see T384003 and T391854) we decided to move to a differ... [13:48:09] rolling/14 [13:48:21] (03CR) 10CI reject: [V:04-1] Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:48:24] err what :D [13:48:55] (03CR) 10Elukey: [C:03+2] airflow: move to mesh:configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144454 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:49:11] (03PS3) 10Bking: Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) [13:50:17] (03CR) 10CI reject: [V:04-1] Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:51:41] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: Testing in progress [13:52:03] (03PS4) 10Bking: Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) [13:52:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:52:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [13:54:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:54:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:54:57] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.10.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1144579 [13:55:05] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.10.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1144579 (owner: 10Volans) [13:56:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P75913 and previous config saved to /var/cache/conftool/dbconfig/20250512-135607-fceratto.json [13:59:26] (03Merged) 10jenkins-bot: Improve session logging [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144532 (https://phabricator.wikimedia.org/T393038) (owner: 10Gergő Tisza) [13:59:41] (03PS2) 10Alexandros Kosiaris: [DNM]: Add mw-wikifunctions-ro to deployment server listeners [puppet] - 10https://gerrit.wikimedia.org/r/1144577 (https://phabricator.wikimedia.org/T389375) [13:59:41] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1144532|Improve session logging (T393038)]] [13:59:43] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144577 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [13:59:44] T393038: Improve MediaWiki session logging - https://phabricator.wikimedia.org/T393038 [14:01:25] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:04:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:34] !log tgr@deploy1003 tgr: Backport for [[gerrit:1144532|Improve session logging (T393038)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:39] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:06:28] (03PS1) 10Hnowlan: trafficserver: route a smaller subset of enwiki pcs pages without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1144581 (https://phabricator.wikimedia.org/T393591) [14:07:09] (03PS1) 10Dreamy Jazz: Revert^2 "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1144582 [14:07:18] (03PS2) 10Dreamy Jazz: Revert^2 "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1144582 (https://phabricator.wikimedia.org/T393236) [14:07:38] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.10.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1144579 (owner: 10Volans) [14:07:49] (03PS3) 10Dreamy Jazz: Revert^2 "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1144582 (https://phabricator.wikimedia.org/T393236) [14:08:27] (03CR) 10Herron: [C:03+1] "🧹🧼" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [14:09:44] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:10:11] (03CR) 10Brouberol: [C:03+1] Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:10:22] !log tgr@deploy1003 tgr: Continuing with sync [14:11:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T392806)', diff saved to https://phabricator.wikimedia.org/P75914 and previous config saved to /var/cache/conftool/dbconfig/20250512-141114-fceratto.json [14:11:30] (03PS1) 10Btullis: ceph: Remove extraneous logging configuration statement [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) [14:11:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:11:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T392806)', diff saved to https://phabricator.wikimedia.org/P75915 and previous config saved to /var/cache/conftool/dbconfig/20250512-141139-fceratto.json [14:12:26] (03CR) 10Btullis: [C:03+1] Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:15:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [14:15:10] (03CR) 10Herron: [C:03+2] logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [14:15:24] (03PS1) 10Volans: Release v0.10.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144585 [14:16:07] (03CR) 10Ayounsi: [C:03+1] Release v0.10.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144585 (owner: 10Volans) [14:16:29] (03CR) 10Volans: [C:03+2] Release v0.10.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1144585 (owner: 10Volans) [14:17:05] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144532|Improve session logging (T393038)]] (duration: 17m 24s) [14:17:08] T393038: Improve MediaWiki session logging - https://phabricator.wikimedia.org/T393038 [14:17:16] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate recountCategories job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144500 (https://phabricator.wikimedia.org/T388533) (owner: 10Hnowlan) [14:18:47] (03CR) 10Bking: [C:03+2] Add cirrussearch1122 plus row A hosts as masters-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:19:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T392806)', diff saved to https://phabricator.wikimedia.org/P75916 and previous config saved to /var/cache/conftool/dbconfig/20250512-141933-fceratto.json [14:19:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:21:51] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin1003.eqiad.wmnet with reason: Release v0.10.1 - volans@cumin1003 [14:22:35] (03PS2) 10Krinkle: tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141517 [14:22:41] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1003.eqiad.wmnet with reason: Release v0.10.1 - volans@cumin1003 [14:23:16] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1068 to cirrussearch1068 [14:23:21] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10811859 (10jcrespo) >>! In T393049#10811102, @PMG wrote: > @jcrespo - I tried to reupload the... [14:23:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:23:40] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:23:47] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:24:27] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate initSiteStats cron to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144517 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [14:24:39] FIRING: [15x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:25:29] inflatador: am I okay to merge your changes as regards cirrussearch1122? [14:27:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1068 to cirrussearch1068 - bking@cumin2002" [14:27:26] hnowlan yes, please go ahead...sorry ;) [14:27:38] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1068 to cirrussearch1068 - bking@cumin2002" [14:27:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:27:39] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1068 on all recursors [14:27:42] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1068 on all recursors [14:27:43] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1068 [14:27:46] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10811876 (10PMG) @jcrespo thank you very much. Fixing this will make my life easier because it... [14:29:39] FIRING: [15x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:30:49] bking@cumin2002 rename (PID 1794500) is awaiting input [14:30:50] (03CR) 10Peter Fischer: [C:03+2] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [14:31:53] (03PS3) 10Krinkle: multiversion: Update readDbListFile() calls from alias to WmfConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141518 [14:32:00] (03Merged) 10jenkins-bot: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [14:32:12] (03PS2) 10Krinkle: tests: Replace array_keys(wikiversions.json) with all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141521 [14:32:15] (03PS3) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 [14:34:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 06Release-Engineering-Team, 10vm-requests: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10811905 (10joanna_borun) p:05Triage→03Medium [14:34:39] FIRING: [17x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:34:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P75917 and previous config saved to /var/cache/conftool/dbconfig/20250512-143441-fceratto.json [14:34:57] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:35:15] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:36:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:36:30] (03PS1) 10Elukey: Add prometheus::instance_defaults to deployment-prep's common settings [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) [14:36:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.803 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:30] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [14:39:34] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet [14:39:35] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet [14:39:49] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53942 bytes in 3.209 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:40:37] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1068 [14:41:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1068 to cirrussearch1068 [14:42:26] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1068.eqiad.wmnet with OS bullseye [14:42:30] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1068 [14:42:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1068 [14:42:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10811953 (10VPuffetMichel) @eevans: yes, please proceed. Thanks! [14:44:23] !log update helm311 and helm317 on deploy2002 - T387548 [14:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:27] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [14:44:39] FIRING: [19x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:45:35] (03CR) 10Filippo Giunchedi: [C:03+1] "Fair enough, ok! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [14:48:43] !log dancy@deploy1003 Installing scap version "4.163.0" for 2 host(s) [14:49:39] FIRING: [18x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2059-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:49:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P75918 and previous config saved to /var/cache/conftool/dbconfig/20250512-144948-fceratto.json [14:50:32] !log dancy@deploy1003 Installation of scap version "4.163.0" completed for 2 hosts [14:51:57] (03CR) 10Scott French: mw::maintenance: move refreshLinkRecommendations job to shared object (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:52:22] (03CR) 10JMeybohm: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [14:53:28] (03PS2) 10Gkyziridis: ml-inference-services: edit-check experirmental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) [14:53:59] (03Abandoned) 10Bking: elastic: decom 6 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125243 (https://phabricator.wikimedia.org/T380529) (owner: 10Ryan Kemper) [14:54:31] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on 60 hosts with reason: surpress CirrusSearchNodeIndexingNotIncreasing alerts with CODFW is depooled [14:57:07] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [14:57:11] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for 60 hosts [14:57:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [14:57:49] (03CR) 10Lucas Werkmeister (WMDE): "AFAICT, that shouldn’t cause any immediate issues (PHP `exec()` doesn’t produce an error if the shell command exists nonzero), though I gu" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [14:57:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 60 hosts [14:58:41] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [14:58:42] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [14:59:01] (03CR) 10Gkyziridis: ml-inference-services: edit-check experirmental prod deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [14:59:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1068.eqiad.wmnet with reason: host reimage [15:00:46] (03PS1) 10Ayounsi: Interface: add validator for child + cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) [15:02:53] (03CR) 10Ayounsi: Interface: add validator for child + cable (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [15:03:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1068.eqiad.wmnet with reason: host reimage [15:04:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T392806)', diff saved to https://phabricator.wikimedia.org/P75919 and previous config saved to /var/cache/conftool/dbconfig/20250512-150454-fceratto.json [15:05:01] (03CR) 10Majavah: "I don't think there is anything deployment-prep specific about this, so maybe set it in `hieradata/cloud.yaml` instead?" [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [15:05:06] !log upgraded spicerack to v10.2.0 on cumin1002 [15:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:55] FIRING: [11x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-admin-ng-pending-changes-aux-k8s-codfw.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:34] Hi! I accidentally merged a wmf-config CR w/o scheduling it for a backport window: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1135010. Can I roll this out myself? [15:08:53] (03Merged) 10jenkins-bot: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [15:09:09] jouncebot: nowandnext [15:09:09] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [15:09:10] In 0 hour(s) and 20 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1530) [15:09:20] pfischer: do you have deployment access? [15:09:20] should be enough time I think [15:09:40] taavi: I should, have deployed before [15:09:59] then yes as long as you're done before the next window [15:10:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:10:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T392806)', diff saved to https://phabricator.wikimedia.org/P75920 and previous config saved to /var/cache/conftool/dbconfig/20250512-151020-fceratto.json [15:10:34] taavi: Alright, thanks. [15:11:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, one suggestion inline as to a slightly different approach but I am happy either way. Thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [15:12:48] (03CR) 10BryanDavis: "+1 to moving to cloud.yaml so that the other 198 Cloud VPS projects benefit as well." [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [15:12:56] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [15:12:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [15:13:06] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [15:13:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [15:17:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T392806)', diff saved to https://phabricator.wikimedia.org/P75921 and previous config saved to /var/cache/conftool/dbconfig/20250512-151709-fceratto.json [15:17:16] (03PS2) 10Btullis: ceph: Remove extraneous logging configuration statement [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) [15:18:20] (03PS2) 10Elukey: Add prometheus::instance_defaults to deployment-prep's common settings [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) [15:18:53] (03PS1) 10Peter Fischer: Revert "CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144595 [15:19:15] (03CR) 10Peter Fischer: [C:03+2] Revert "CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144595 (owner: 10Peter Fischer) [15:19:19] (03CR) 10Elukey: "Done! Lemme know if it is ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [15:19:22] (03PS3) 10Elukey: Add prometheus::instance_defaults to cloud's common settings [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) [15:19:24] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5528/co" [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [15:19:59] (03Merged) 10jenkins-bot: Revert "CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144595 (owner: 10Peter Fischer) [15:25:54] (03CR) 10BryanDavis: Add prometheus::instance_defaults to cloud's common settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [15:27:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1068.eqiad.wmnet with OS bullseye [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1530). nyaa~ [15:32:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P75922 and previous config saved to /var/cache/conftool/dbconfig/20250512-153216-fceratto.json [15:33:15] (03CR) 10Peter Fischer: [C:03+1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [15:33:26] (03PS1) 10Filippo Giunchedi: etcd: prometheus already has access to all ports [puppet] - 10https://gerrit.wikimedia.org/r/1144601 (https://phabricator.wikimedia.org/T389170) [15:34:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [15:34:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo [15:34:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo [15:35:22] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1069 to cirrussearch1069 [15:35:38] (03CR) 10CDanis: [C:03+1] chart-renderer: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144461 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [15:35:41] (03CR) 10Filippo Giunchedi: Add prometheus::instance_defaults to cloud's common settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [15:35:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:36:23] (03CR) 10Filippo Giunchedi: "IMHO good to remove this rule in general, also helps with Ib3a314487604" [puppet] - 10https://gerrit.wikimedia.org/r/1144601 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [15:37:25] (03PS2) 10CDanis: Switch status page to haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1143868 (https://phabricator.wikimedia.org/T202061) [15:37:46] (03CR) 10CDanis: [C:03+2] move geoip to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis) [15:38:36] (03CR) 10CDanis: [C:03+2] Switch status page to haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1143868 (https://phabricator.wikimedia.org/T202061) (owner: 10CDanis) [15:40:05] (03CR) 10Joal: Bump nodemanager heap on the production Hadoop cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144520 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [15:40:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1069 to cirrussearch1069 - bking@cumin2002" [15:41:05] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1069 to cirrussearch1069 - bking@cumin2002" [15:41:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:05] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1069 on all recursors [15:41:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1069 on all recursors [15:41:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1069 [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:55] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1069 [15:43:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1069 to cirrussearch1069 [15:44:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1069.eqiad.wmnet with OS bullseye [15:44:18] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1069 [15:44:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1069 [15:44:54] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10812480 (10elukey) About 1, I think that alerting in this case could be summarized in this way: - Pyrra automatically create 18h, 6h, 1:30h, 15m burndown aler... [15:47:01] (03CR) 10DCausse: [C:03+1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144600 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [15:47:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P75924 and previous config saved to /var/cache/conftool/dbconfig/20250512-154723-fceratto.json [15:47:49] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [15:52:40] (03PS2) 10Ebernhardson: search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) [15:52:40] (03PS9) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [15:52:53] (03CR) 10Ayounsi: Interface: add validator for child + cable (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [15:55:48] FIRING: PuppetFailure: Puppet has failed on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:56:07] (03CR) 10Ebernhardson: search: add discovery records for secondary clusters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [15:56:20] (03CR) 10Cathal Mooney: [C:03+1] Interface: add validator for child + cable (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [15:56:45] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10812544 (10jcrespo) So I did: ` mwscript importImages.php --wiki=commonswiki --sleep=1 --comm... [15:56:51] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10812545 (10jcrespo) 05Open→03Resolved a:03jcrespo [15:57:53] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10812549 (10bd808) @bking Could you make some time to look into these failures? https://opensta... [15:58:27] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1069.eqiad.wmnet with reason: host reimage [15:58:38] (03CR) 10Ayounsi: Interface: add validator for child + cable (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) (owner: 10Ayounsi) [15:59:32] (03CR) 10Elukey: [C:03+1] etcd: prometheus already has access to all ports [puppet] - 10https://gerrit.wikimedia.org/r/1144601 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [15:59:46] (03PS2) 10Ayounsi: Interface: add validator for child + non-virtual [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1144593 (https://phabricator.wikimedia.org/T393188) [16:00:48] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10812563 (10bking) [16:01:18] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10812564 (10bking) Thanks for the ping, @bd808 . I'm not sure when we'll... [16:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T392806)', diff saved to https://phabricator.wikimedia.org/P75925 and previous config saved to /var/cache/conftool/dbconfig/20250512-160230-fceratto.json [16:02:55] FIRING: [11x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-admin-ng-pending-changes-aux-k8s-codfw.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1069.eqiad.wmnet with reason: host reimage [16:03:37] (03CR) 10Jelto: [C:03+2] helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [16:05:07] !log update helm311 and helm317 on deploy1003 - T387548 [16:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:13] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [16:06:06] (03Abandoned) 10Elukey: Add prometheus::instance_defaults to cloud's common settings [puppet] - 10https://gerrit.wikimedia.org/r/1144589 (https://phabricator.wikimedia.org/T393866) (owner: 10Elukey) [16:12:20] (03PS3) 10Dwisehaupt: frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) [16:14:19] (03CR) 10Dwisehaupt: [C:03+2] frack: update A and PTR records for NAT mappings [dns] - 10https://gerrit.wikimedia.org/r/1140785 (https://phabricator.wikimedia.org/T392843) (owner: 10Dwisehaupt) [16:14:34] !log dwisehaupt@dns1004 START - running authdns-update [16:15:36] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:15:43] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:16:14] !log dwisehaupt@dns1004 END - running authdns-update [16:16:30] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1070 to cirrussearch1070 [16:16:55] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:17:45] !log update helm311 and helm317 on contint1002 contint2002 - T387548 [16:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:48] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [16:18:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:19:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1069.eqiad.wmnet with OS bullseye [16:20:07] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10812682 (10jcrespo) Deleted the test wikipedia files, too FYI [16:20:41] (03PS1) 10Dwisehaupt: community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) [16:21:21] (03CR) 10Dwisehaupt: "This should hopefully be the last step to the mail configuration for community-crm" [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:21:22] (03CR) 10CI reject: [V:04-1] community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:22:07] FIRING: [3x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:22:18] (03CR) 10Ssingh: "You have a CNAME above so you can't have any other records after or in addition to that." [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:22:31] bking@cumin2002 rename (PID 1844915) is awaiting input [16:23:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:25:08] (03CR) 10Dwisehaupt: community-crm: Add mx records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:25:39] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1070 to cirrussearch1070 - bking@cumin2002" [16:26:57] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1070 to cirrussearch1070 - bking@cumin2002" [16:26:58] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:26:58] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1070 on all recursors [16:27:02] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1070 on all recursors [16:27:02] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1070 [16:28:28] (03PS7) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [16:28:31] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1070 [16:28:38] (03CR) 10Hnowlan: "yeah 100%, fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:29:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1070 to cirrussearch1070 [16:29:15] (03CR) 10Hnowlan: "Holding on this one for the moment until we hear back from the team" [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [16:29:21] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:29:55] (03CR) 10BCornwall: [C:03+1] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [16:30:40] (03PS3) 10Hnowlan: mw::maintenance: migrate all image suggestions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) [16:30:45] (03PS1) 10Elukey: istio: introduce legacy images to backport features [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1144612 (https://phabricator.wikimedia.org/T392886) [16:32:00] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin1002.eqiad.wmnet with reason: Release v0.10.1 - volans@cumin1003 [16:32:11] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10812752 (10taavi) The likely fix for this will be migrating that file d... [16:32:52] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1002.eqiad.wmnet with reason: Release v0.10.1 - volans@cumin1003 [16:33:17] (03CR) 10Federico Ceratto: "I replied with few small changes" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:33:20] !log volans@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin1003 [16:34:09] !log volans@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin1003 [16:34:22] (03CR) 10Hnowlan: [C:03+2] "Old comment posted in error. No team feedback, proceeding." [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [16:34:43] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate all image suggestions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140672 (https://phabricator.wikimedia.org/T388537) (owner: 10Hnowlan) [16:35:32] FIRING: ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:35:48] RESOLVED: PuppetFailure: Puppet has failed on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:35:50] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:38:26] (03CR) 10Dwisehaupt: "@ssingh@wikimedia.org Thanks. I'm just now finding the comments around the dyna.wm.o entry that point to us needing to use DYNA." [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:42:09] (03PS2) 10Dwisehaupt: community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) [16:42:34] (03CR) 10Ssingh: community-crm: Add mx records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:42:44] (03CR) 10CI reject: [V:04-1] community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:42:54] (03CR) 10Ssingh: community-crm: Add mx records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:43:15] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:43:32] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:45:08] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:45:49] (03CR) 10Dwisehaupt: community-crm: Add mx records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:46:08] (03PS3) 10Dwisehaupt: community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) [16:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:15] (03CR) 10Ssingh: [C:03+1] community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:52:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:52:19] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:53:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141517 (owner: 10Krinkle) [16:53:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141518 (owner: 10Krinkle) [16:53:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141521 (owner: 10Krinkle) [16:54:07] (03Merged) 10jenkins-bot: tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141517 (owner: 10Krinkle) [16:54:10] (03Merged) 10jenkins-bot: multiversion: Update readDbListFile() calls from alias to WmfConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141518 (owner: 10Krinkle) [16:54:12] (03Merged) 10jenkins-bot: tests: Replace array_keys(wikiversions.json) with all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141521 (owner: 10Krinkle) [16:54:56] Krinkle: Have you used https://spiderpig.wikimedia.org yet ? [16:55:36] (03PS5) 10Scott French: P:mw::maintenance::refreshlinks: migrate s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) [16:55:55] Not yet. I prefer more direct control so that I can respond to problems more quickly than I think I can from a web UI. Especially rollbacks, fatal monitor, mwdebug ssh, git logs, etc. [16:56:57] (03CR) 10Hnowlan: [C:03+1] P:mw::maintenance::refreshlinks: migrate s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [16:59:45] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1141517|tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily()]], [[gerrit:1141518|multiversion: Update readDbListFile() calls from alias to WmfConfig]], [[gerrit:1141521|tests: Replace array_keys(wikiversions.json) with all.dblist]] [16:59:46] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/RESTBase [17:00:04] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1700). [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T1700). [17:00:39] o/ [17:00:39] Krinkle: Thanks for your response. I've made some notes. [17:01:38] Krinkle: Regarding `fatal monitors`, I assume that means looking at logstash, logspam-watch, or something of that nature? [17:02:23] Krinkle: can you ping me when you're done with your backport? I have some periodic job migrations planned for the infra window :) [17:03:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936 (10cmooney) 03NEW p:05Triage→03Medium [17:03:22] ack, should be done in a few minutes. This is a minor refactor / no-op. I wouldn't mind you rolling out puppet/ or charts changes at the same time. [17:03:35] (03PS1) 10DCausse: cirrus-streaming-updater: alert when 0 tasks are registered [alerts] - 10https://gerrit.wikimedia.org/r/1144617 [17:03:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10812996 (10cmooney) [17:03:58] (03CR) 10Hashar: "Sorry I do not have the context for this change. Zuul does emit a lot of metrics over statsd which we rely on to monitor it state. It loo" [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [17:04:46] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1141517|tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily()]], [[gerrit:1141518|multiversion: Update readDbListFile() calls from alias to WmfConfig]], [[gerrit:1141521|tests: Replace array_keys(wikiversions.json) with all.dblist]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:07:27] (03CR) 10Ebernhardson: [C:03+2] cirrus-streaming-updater: alert when 0 tasks are registered [alerts] - 10https://gerrit.wikimedia.org/r/1144617 (owner: 10DCausse) [17:07:54] 06SRE: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938 (10MichaelDellaBitta) 03NEW [17:08:38] (03CR) 10Scott French: [C:03+2] P:mw::maintenance::refreshlinks: migrate s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [17:08:51] (03Merged) 10jenkins-bot: cirrus-streaming-updater: alert when 0 tasks are registered [alerts] - 10https://gerrit.wikimedia.org/r/1144617 (owner: 10DCausse) [17:09:54] !log krinkle@deploy1003 krinkle: Continuing with sync [17:10:26] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1070.eqiad.wmnet with OS bullseye [17:10:32] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1070 [17:10:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1070 [17:11:56] (03PS1) 10Fabfur: cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) [17:12:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [17:13:41] (03PS11) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [17:13:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T393870#10813058 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:15:00] (03PS2) 10Fabfur: cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) [17:15:46] (03CR) 10Dwisehaupt: "@ssingh@wikimedia.org Thanks for the help here. I've used this as a chance to install tox and gdnsd to my environment so I can hopefully c" [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:16:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [17:16:31] (03CR) 10Scott French: [C:03+2] P:mw::maint::temporary_accounts: purge_temporary_accounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143197 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:16:51] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141517|tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily()]], [[gerrit:1141518|multiversion: Update readDbListFile() calls from alias to WmfConfig]], [[gerrit:1141521|tests: Replace array_keys(wikiversions.json) with all.dblist]] (duration: 17m 05s) [17:16:56] swfrench-wmf: done. [17:17:03] Krinkle: ack, thanks! [17:17:24] (03CR) 10Ssingh: [C:03+1] "No worries @dwisehaupt@wikimedia.org, we are always happy to help review this stuff anyway!" [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:18:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [17:18:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) (owner: 10Krinkle) [17:22:04] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:23:13] (03PS2) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki & ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [17:25:01] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:25:10] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:28:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1070.eqiad.wmnet with reason: host reimage [17:31:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1070.eqiad.wmnet with reason: host reimage [17:34:59] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:38:11] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate dns recrods for new codfw switches - cmooney@cumin1002" [17:38:29] (03PS1) 10Cathal Mooney: Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) [17:38:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate dns recrods for new codfw switches - cmooney@cumin1002" [17:38:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:39:07] (03CR) 10CI reject: [V:04-1] Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) (owner: 10Cathal Mooney) [17:39:50] (03CR) 10BryanDavis: [C:03+1] [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [17:43:32] (03PS2) 10Cathal Mooney: Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) [17:48:43] PROBLEM - Hadoop NodeManager on an-worker1197 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:49:18] 06SRE: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938#10813278 (10Aklapper) Hi, https://github.com/dpla/ingest-wikimedia/blob/main/ingest_wikimedia/web.py#L36-L38 say ` HTTP_REQUEST_HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0... [17:49:50] (03PS3) 10Cathal Mooney: Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) [17:50:07] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:50:32] RESOLVED: ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:50:45] (03CR) 10Ssingh: [C:03+1] Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) (owner: 10Cathal Mooney) [17:50:51] (03CR) 10Vgutierrez: "looks good, see online comment.. also you could unset the header unconditionally on varnish" [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [17:51:20] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE files for new IPv6 addresses in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/1144625 (https://phabricator.wikimedia.org/T382219) (owner: 10Cathal Mooney) [17:51:33] !log cmooney@dns2005 START - running authdns-update [17:53:07] (03PS6) 10Scott French: P:mw::maint::purge_expired_userrights: purge_expired_userrights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143198 (https://phabricator.wikimedia.org/T385866) [17:53:11] (03PS6) 10Scott French: P:mw::maint::purge_expired_userrights: purge_expired_global_rights to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143199 (https://phabricator.wikimedia.org/T385866) [17:55:41] PROBLEM - Hadoop NodeManager on an-worker1192 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:55:50] (03CR) 10Eevans: "Not sure how much I can be reviewing this one. 😳" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144457 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [17:56:17] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:57:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948 (10RobH) 03NEW [17:58:02] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:59:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10813337 (10RobH) a:03klausman Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) an... [17:59:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1070.eqiad.wmnet with OS bullseye [17:59:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10813341 (10RobH) [17:59:54] !log cmooney@dns2005 START - running authdns-update [18:00:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:40] !log cmooney@dns2005 END - running authdns-update [18:02:07] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:02:24] (03PS2) 10Alexandros Kosiaris: partman: Add a kubernetes-node-containerd-efi recipe [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) [18:02:24] (03PS1) 10Alexandros Kosiaris: preseed: Use EFI recipes for aux-k8s-worker[12]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1144627 (https://phabricator.wikimedia.org/T393053) [18:08:58] 10ops-codfw, 06SRE, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951#10813372 (10RobH) 05Open→03Declined never got this test nic went a different route (known good broadcom) [18:13:43] RECOVERY - Hadoop NodeManager on an-worker1197 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:18:11] PROBLEM - ElasticSearch setting check - 9600 on elastic1095 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] for .( [18:18:11] https://wikitech.wikimedia.org/wiki/Search%23Administration [18:19:41] RECOVERY - Hadoop NodeManager on an-worker1192 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:20:17] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:22:27] PROBLEM - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] for .( [18:22:27] https://wikitech.wikimedia.org/wiki/Search%23Administration [18:22:42] 10ops-codfw, 06SRE, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951#10813401 (10RobH) [18:35:04] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_ulsfo [18:35:32] ^^ checking out the Elastic errors [18:37:02] 06SRE: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938#10813436 (10MichaelDellaBitta) Hi Aklapper, thank you for your response! My understanding is that that only affects the part of the code that downloads the images from the host instituti... [18:37:21] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_ulsfo [18:38:37] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10813437 (10Eevans) [18:39:20] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700 [18:39:20] cluster Brian_King Running Puppet to clear these up, should be fixed soon https://wikitech.wikimedia.org/wiki/Search%23Administration [18:39:20] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1095 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700 [18:39:20] cluster Brian_King Running Puppet to clear these up, should be fixed soon https://wikitech.wikimedia.org/wiki/Search%23Administration [18:39:59] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700 [18:39:59] cluster Brian_King Running Puppet and these should clear up soon https://wikitech.wikimedia.org/wiki/Search%23Administration [18:39:59] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1095 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700 [18:39:59] cluster Brian_King Running Puppet and these should clear up soon https://wikitech.wikimedia.org/wiki/Search%23Administration [18:42:48] (03CR) 10Dwisehaupt: [C:03+2] community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [18:42:55] (03PS4) 10Dwisehaupt: community-crm: Add mx records [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) [18:43:51] PROBLEM - ElasticSearch setting check - 9600 on elastic1073 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] for .( [18:43:51] https://wikitech.wikimedia.org/wiki/Search%23Administration [18:43:53] PROBLEM - ElasticSearch setting check - 9600 on elastic1083 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] for .( [18:43:53] https://wikitech.wikimedia.org/wiki/Search%23Administration [18:43:55] PROBLEM - ElasticSearch setting check - 9600 on elastic1102 is CRITICAL: CRITICAL - [elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] does not match [cirrussearch1073.eqiad.wmnet:9700, elastic1073.eqiad.wmnet:9700, elastic1075.eqiad.wmnet:9700, elastic1083.eqiad.wmnet:9700, elastic1095.eqiad.wmnet:9700, elastic1102.eqiad.wmnet:9700] for .( [18:43:55] https://wikitech.wikimedia.org/wiki/Search%23Administration [18:44:22] (03CR) 10Dwisehaupt: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1144610 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [18:44:43] (03CR) 10BCornwall: [C:03+2] varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:45:25] !log dwisehaupt@dns1004 START - running authdns-update [18:46:38] !log dwisehaupt@dns1004 END - running authdns-update [18:49:17] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:49:45] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10813465 (10PMG) @jcrespo thank you very much. Everything works correct. Have a great day/... [18:52:05] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:53:14] (03PS1) 10Jdlrobson: Update to echarts 5.6.0 [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144638 (https://phabricator.wikimedia.org/T393377) [18:58:42] (03PS1) 10Bking: elastic: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1144639 (https://phabricator.wikimedia.org/T393100) [18:59:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144639 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [19:00:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144638 (https://phabricator.wikimedia.org/T393377) (owner: 10Jdlrobson) [19:01:05] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:06:34] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10813580 (10Ijon) Can anyone provide an update on this? [19:12:11] (03PS2) 10Bking: elastic: don't filter out self cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1144639 (https://phabricator.wikimedia.org/T393100) [19:12:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144639 (https://phabricator.wikimedia.org/T393100) (owner: 10Bking) [19:14:17] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:15:21] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:16:21] (03CR) 10Bking: [C:03+2] cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:20:21] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10813646 (10BCornwall) [19:22:24] 06SRE, 06Traffic: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938#10813651 (10Aklapper) [19:22:49] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10813654 (10BCornwall) a:03BCornwall [19:23:52] (03PS3) 10Bking: search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:25:42] (03PS10) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [19:29:05] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:30:08] (03PS3) 10Fabfur: cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) [19:31:06] (03PS4) 10Bking: search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:31:15] (03PS11) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [19:32:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:33:09] (03CR) 10Bking: [C:03+2] search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:33:17] (03CR) 10Bking: [C:03+2] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:33:57] (03PS4) 10Fabfur: cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) [19:34:00] !log bking@dns1004 START - running authdns-update [19:35:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [19:35:52] (03CR) 10Fabfur: "that was the first idea but I thought about people reading the vcl during the experiments and asking about this header..." [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [19:40:17] !log bking@dns1004 START - running authdns-update [19:40:24] (03PS5) 10Fabfur: cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) [19:42:09] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:44:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:19] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:45:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:46:35] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:47:19] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:48:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 09b943f9c0fb4cda652306c1c049ef886e250ad5, dns.git is f2aed99724e2b0e8fa0987851c7eac732ed79628) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:48:11] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:49:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [19:49:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T392806)', diff saved to https://phabricator.wikimedia.org/P75928 and previous config saved to /var/cache/conftool/dbconfig/20250512-194933-fceratto.json [19:49:56] ebernhardson: please run authdns-update to roll out the changes [19:49:58] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache search-chi.svc.eqiad.wmnet on all recursors [19:50:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) search-chi.svc.eqiad.wmnet on all recursors [19:50:09] since it was submitted but not deployed [19:50:20] and hence the failing Icinga check [19:50:23] inflatador: ^ [19:52:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [19:52:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10813758 (10cmooney) [19:54:08] (03PS1) 10Ayounsi: Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) [19:54:35] (03CR) 10Gergő Tisza: [C:04-2] "Per the task, needs further work, or at least further testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [19:55:05] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:57:30] sukhe I did run authdns-update , looks like it failed [19:57:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T392806)', diff saved to https://phabricator.wikimedia.org/P75929 and previous config saved to /var/cache/conftool/dbconfig/20250512-195732-fceratto.json [19:57:49] moving conversation to #traffic where it's less busy [19:58:15] !log bking@dns1004 START - running authdns-update [20:00:01] PROBLEM - Hadoop NodeManager on an-worker1196 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and thcipriani: #bothumor I � Unicode. All rise for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T2000). [20:00:05] bwang, dr0ptp4kt, tgr, and Krinkle: A patch you scheduled for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:00:15] o/ [20:00:19] o/ [20:00:21] ohai [20:00:46] ? [20:00:55] ah patch, nevermind :) [20:01:11] I can deploy [20:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:11] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:02:11] looks like all the config patches can go together? [20:02:16] except maybe the dblist one [20:02:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:43] bwang: around for backports? [20:03:53] sounds good to me (RE tgr_ ). [20:06:17] (03CR) 10Vgutierrez: cache: lua lookup experiment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [20:07:36] (03PS2) 10Ayounsi: Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) [20:08:08] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [20:10:13] !log bearloga@deploy1003 Started deploy [airflow-dags/analytics_product@17f8417]: (no justification provided) [20:10:38] alrighty, getting started with dr0ptp4kt 's config change [20:10:40] (03PS1) 10Bking: Revert "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1144645 [20:10:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dr0ptp4kt@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [20:10:46] (03CR) 10Bking: [V:03+2 C:03+2] Revert "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1144645 (owner: 10Bking) [20:11:03] !log bearloga@deploy1003 Finished deploy [airflow-dags/analytics_product@17f8417]: (no justification provided) (duration: 00m 53s) [20:11:14] (03PS1) 10Bking: Revert "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1144646 [20:11:18] (03PS4) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [20:11:18] (03PS1) 10Ebernhardson: Add search-chi-https service [puppet] - 10https://gerrit.wikimedia.org/r/1144647 [20:11:28] (03Merged) 10jenkins-bot: Stream config for edge uniques on prod cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [20:11:28] (03PS2) 10Bking: Revert "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1144646 [20:11:44] !log dr0ptp4kt@deploy1003 Started scap sync-world: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]] [20:11:47] T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 - https://phabricator.wikimedia.org/T391959 [20:11:54] (03CR) 10Thcipriani: [C:03+2] "Getting changes merging for backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144488 (https://phabricator.wikimedia.org/T393621) (owner: 10Gergő Tisza) [20:12:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P75930 and previous config saved to /var/cache/conftool/dbconfig/20250512-201240-fceratto.json [20:12:42] (03CR) 10Bking: [C:03+2] Revert "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1144646 (owner: 10Bking) [20:12:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10813835 (10BCornwall) @VRiley-WMF Yes, A7 as detailed in T387145#10720903. It's idling at the moment and can be serviced. Thanks! [20:13:17] !log sukhe@dns1004 START - running authdns-update [20:14:20] (03PS1) 10Dwisehaupt: postfix: add community-crm as a valid relay domain [puppet] - 10https://gerrit.wikimedia.org/r/1144648 (https://phabricator.wikimedia.org/T383715) [20:14:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:14:34] !log sukhe@dns1004 END - running authdns-update [20:15:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:18] (03PS2) 10Scott French: P:mw::maintenance::refreshlinks: migrate remaining shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1144637 (https://phabricator.wikimedia.org/T388530) [20:15:19] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10813838 (10BCornwall) Here's a screenshot as well that may help Dell: {F59918797} [20:15:19] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:21] ^ inflatador all good [20:15:30] (03PS2) 10Dwisehaupt: postfix: add community-crm as a valid relay domain [puppet] - 10https://gerrit.wikimedia.org/r/1144648 (https://phabricator.wikimedia.org/T383715) [20:15:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:53] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:15:53] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:16:28] !log dr0ptp4kt@deploy1003 dr0ptp4kt: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:35] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:16:47] (03PS3) 10Ayounsi: Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) [20:17:02] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [20:17:19] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:18:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:22:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:22:25] (03PS4) 10Ayounsi: Remove Turnilo dependency on netops:monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1144644 (https://phabricator.wikimedia.org/T388641) [20:22:44] (03Merged) 10jenkins-bot: Do not do unnecessary fallback during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144488 (https://phabricator.wikimedia.org/T393621) (owner: 10Gergő Tisza) [20:23:57] !log dr0ptp4kt@deploy1003 dr0ptp4kt: Continuing with sync [20:24:10] Krinkle: did you want to deploy yours together or do they need to go out separately? [20:24:39] if mine are not combined with the others, then doing mine together is fine. [20:25:01] RECOVERY - Hadoop NodeManager on an-worker1196 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:25:06] otherwise, I'd combine only the 'mc' change with the others. [20:25:12] and then my second one later. [20:27:01] ack [20:27:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P75931 and previous config saved to /var/cache/conftool/dbconfig/20250512-202746-fceratto.json [20:30:13] (03PS1) 10BCornwall: admin: Add bwojtowicz to ML-related accesses [puppet] - 10https://gerrit.wikimedia.org/r/1144649 (https://phabricator.wikimedia.org/T393595) [20:30:38] !log dr0ptp4kt@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]] (duration: 18m 53s) [20:30:41] T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 - https://phabricator.wikimedia.org/T391959 [20:30:48] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3009.esams.wmnet [20:31:32] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts lvs3009.esams.wmnet [20:33:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) (owner: 10Krinkle) [20:34:27] (03Merged) 10jenkins-bot: mc: remove unused "memcached-pecl" definition from wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135529 (https://phabricator.wikimedia.org/T371378) (owner: 10Krinkle) [20:34:28] Krinkle: ^ tgr_ 's already merged so doing that one alongside [20:34:58] ok [20:35:12] This is meant to be a no-op, but I'll do some basic testing once it's on mwdebug. [20:35:58] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1135529|mc: remove unused "memcached-pecl" definition from wgObjectCaches (T371378)]] [20:36:01] T371378: Cleanup: Wikitech code leftovers - https://phabricator.wikimedia.org/T371378 [20:36:48] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10813893 (10RobH) They finally answered back first asking simple questions like if the network port or cable are bad (they aren't) and then after another 48 hours requesting firmwar... [20:37:43] (03PS1) 10Ayounsi: Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) [20:37:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [20:40:40] !log tgr@deploy1003 tgr, krinkle: Backport for [[gerrit:1135529|mc: remove unused "memcached-pecl" definition from wgObjectCaches (T371378)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:42:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T392806)', diff saved to https://phabricator.wikimedia.org/P75932 and previous config saved to /var/cache/conftool/dbconfig/20250512-204253-fceratto.json [20:43:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [20:43:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:43:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T392806)', diff saved to https://phabricator.wikimedia.org/P75933 and previous config saved to /var/cache/conftool/dbconfig/20250512-204336-fceratto.json [20:43:45] * Krinkle is testing [20:45:46] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10813932 (10herron) > 2. In the Pyrra Grafana dashboards that are exported. Ideally we'd want to avoid setting the time manually and/or use a specialized URI ev... [20:46:22] LGTM! [20:46:35] cool, thanks :) [20:46:38] !log tgr@deploy1003 tgr, krinkle: Continuing with sync [20:46:56] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:48] (03PS4) 10Scott French: P:mediawiki::php: add uuid extension for PHP 8.1+ [puppet] - 10https://gerrit.wikimedia.org/r/1139947 (https://phabricator.wikimedia.org/T373752) [20:51:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T392806)', diff saved to https://phabricator.wikimedia.org/P75934 and previous config saved to /var/cache/conftool/dbconfig/20250512-205143-fceratto.json [20:53:25] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135529|mc: remove unused "memcached-pecl" definition from wgObjectCaches (T371378)]] (duration: 17m 27s) [20:53:28] T371378: Cleanup: Wikitech code leftovers - https://phabricator.wikimedia.org/T371378 [20:58:14] thcipriani: I knew syncs take 20-25min when a mw core/ext patch is involved (given CI alone will take 15min). I'm surprised config patches take that long now as well. I noticed this earlier today with my own deploy as well. [20:58:50] I thought wmf-config was in the last docker image layer, so that should be a relatively quick build and sync in theory. At some point, it was, I think? Has that changed? [21:00:04] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T2100) [21:00:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10813970 (10Jdlrobson-WMF) > @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed L3 Acknowledgement of Wikimedia Server Access Responsibilities ? I don't... [21:00:54] Krinkle: we've change a few things about image building, but it should still be that one layer that is getting pushed to the registry; however, we're building more images right now during the php8.1 trasition [21:01:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [21:01:31] (03CR) 10CI reject: [V:04-1] multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [21:01:50] (03PS4) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 [21:02:49] (03CR) 10TrainBranchBot: "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [21:03:02] Hey all - I’d like to get two sec deploys out now, if we can. [21:03:35] (03Merged) 10jenkins-bot: multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [21:03:38] sbassett: still some backport window happening but i imagine that'll be fine shortly [21:03:41] previous deploy window is not yet finished, probably 10min? [21:03:49] (03PS1) 10JHathaway: community-civicrm: listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1144657 (https://phabricator.wikimedia.org/T383715) [21:03:51] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1141522|multiversion: Move remaining dblist helper to WmfConfig class]] [21:04:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1144657 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [21:04:11] Krinkle: also the previous deploy window actually included tgr's backport as well [21:04:41] the one happening now should be more representative of a pure config change timing [21:04:51] ack [21:05:00] sbassett: sorry, didn't notice sooner [21:05:08] do you want me to stop scap? [21:05:28] we aren't doing anything important [21:06:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P75935 and previous config saved to /var/cache/conftool/dbconfig/20250512-210650-fceratto.json [21:06:51] I wonder how much is still spent in l10n building and cdb converting. I suppose the "json-to-cdb (MW) > cdb-to-json (Scap) > json-to-cdb + md5 checks (Scap)" have now been reduced to just "json-to-cdb (MW)". Or maybe not yet since there's still a handful of rsync destinations [21:07:23] s/rsync destinations/php74 destinations" [21:07:45] tgr_: nope, I can wait a bit. [21:08:33] !log tgr@deploy1003 tgr, krinkle: Backport for [[gerrit:1141522|multiversion: Move remaining dblist helper to WmfConfig class]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:58] Krinkle: not really testable, right? [21:09:22] I'm testing a few pageviews and uncached API requests and noc views right now to make sure nothing fatals [21:10:32] tgr_: LGTM [21:10:37] !log tgr@deploy1003 tgr, krinkle: Continuing with sync [21:10:42] There are still 11 legacy bare metal machines, so we haven't been able to peel out the bare metal bits for cdb files. There are step timings on: https://grafana.wikimedia.org/d/000000086/scap?orgId=1&from=now-90d&to=now&timezone=utc&var-component=scap&refresh=1m [21:11:03] Krinkle: l10n handling is an insignificant time factor _unless_ a large l10n rebuild occurs, in which case the effect s of it are a significant time factor. [21:16:37] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet [21:17:17] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141522|multiversion: Move remaining dblist helper to WmfConfig class]] (duration: 13m 25s) [21:20:18] sbassett: You're up! [21:20:21] (03CR) 10JHathaway: [C:03+2] community-civicrm: listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1144657 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [21:20:30] dancy: thanks! [21:21:17] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1124 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:21:19] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1124 is CRITICAL: CRITICAL - skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:21:36] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3009.esams.wmnet [21:21:52] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts lvs3009.esams.wmnet [21:21:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P75936 and previous config saved to /var/cache/conftool/dbconfig/20250512-212157-fceratto.json [21:22:46] 10ops-codfw, 06DC-Ops, 06Traffic: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968 (10BCornwall) 03NEW [21:22:56] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3009.esams.wmnet [21:24:39] (03CR) 10Eevans: [C:03+2] JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:27:09] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Fri 06 Jun 2025 10:25:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [21:28:44] !log ryankemper@cumin2002 conftool action : set/pooled=no; selector: name=cirrussearch2091.codfw.wmnet|cirrussearch2055.codfw.wmnet|cirrussearch2113.codfw.wmnet|cirrussearch1118.eqiad.wmnet|elastic1080.eqiad.wmnet|elastic1057.eqiad.wmnet|elastic1059.eqiad.wmnet|elastic1083.eqiad.wmnet|elastic1076.eqiad.wmnet [21:30:57] (03PS1) 10Eevans: cassandra-dev: preseed cassandra-dev2001 for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144661 (https://phabricator.wikimedia.org/T391544) [21:31:15] !log Testing rsyslog_8.2504.0-1~bpo12+1 on centrallog1002 - T383309 [21:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:18] T383309: rsyslog receiver on centrallog hosts misplaces some log host entries - https://phabricator.wikimedia.org/T383309 [21:31:36] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_codfw [21:31:40] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_codfw [21:31:53] !log Removed mitigation for T390887 and T393367 [21:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:34] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs3009.esams.wmnet [21:35:30] (03CR) 10Eevans: [C:03+2] cassandra-dev: preseed cassandra-dev2001 for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144661 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:37:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T392806)', diff saved to https://phabricator.wikimedia.org/P75937 and previous config saved to /var/cache/conftool/dbconfig/20250512-213704-fceratto.json [21:37:05] (03CR) 10JHathaway: [C:03+2] postfix: add community-crm as a valid relay domain [puppet] - 10https://gerrit.wikimedia.org/r/1144648 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:37:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:37:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T392806)', diff saved to https://phabricator.wikimedia.org/P75938 and previous config saved to /var/cache/conftool/dbconfig/20250512-213731-fceratto.json [21:39:19] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f48a9624f50: Failed to establish a new connection: [Errno 113 [21:39:19] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:19] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards: 0, number_of_pending_ta [21:40:19] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:50] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2029.codfw.wmnet with reason: Potential failed memory - T393968 [21:40:53] T393968: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968 [21:41:07] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2029.codfw.wmnet with reason: Potential failed memory - T393968 [21:42:23] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [21:42:33] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10814102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2001.... [21:42:34] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3009.esams.wmnet [21:42:35] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814103 (10RobH) idrac updated, applying bios now [21:43:46] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10814107 (10BCornwall) Hi, @cmassaro, are you able to provide the confirmation of your new key in the manner described above? Thanks! [21:43:59] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10814110 (10BCornwall) 05In progress→03Stalled p:05Triage→03Medium [21:44:59] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10814128 (10BCornwall) 05In progress→03Stalled Hi, @Seddon, are you able to provide the confirmation of your new key in the manner described above? Thanks! [21:45:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T392806)', diff saved to https://phabricator.wikimedia.org/P75939 and previous config saved to /var/cache/conftool/dbconfig/20250512-214542-fceratto.json [21:46:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10814137 (10BCornwall) 05In progress→03Stalled a:03Jdlrobson-WMF Hi, @Jdlrobson-WMF, could you please sign the L3 doc? Thanks! [21:46:47] robh@cumin2002 upgrade-firmware (PID 1994852) is awaiting input [21:47:05] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10814144 (10BCornwall) a:03Seddon [21:47:09] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10814145 (10BCornwall) a:03cmassaro [21:47:21] PROBLEM - Host lvs3009 is DOWN: PING CRITICAL - Packet loss = 100% [21:47:41] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:48:42] !log Deployed security fixes 03, 04 and 05 for T392341 [21:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:20] Ok, that should wrap up the security deploys for now. [21:52:36] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs3009.esams.wmnet [21:52:57] RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 80.26 ms [21:53:41] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:55:19] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f9fb2420fd0: Failed to establish a new connection: [Errno 113 [21:55:19] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:56:19] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [21:56:19] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:33] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [21:58:48] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3009.esams.wmnet [21:59:21] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet [22:00:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P75940 and previous config saved to /var/cache/conftool/dbconfig/20250512-220049-fceratto.json [22:01:41] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:02:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:02:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [22:02:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:05] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814233 (10RobH) bios updated, applying nic firmware update now [22:04:08] 👋 [22:04:12] !incidents [22:04:13] 6117 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [22:04:13] 6115 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [22:04:13] 6114 (RESOLVED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [22:04:17] !ack 6117 [22:04:17] 6117 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [22:04:18] o/ here as well [22:04:41] looking at https://wikitech.wikimedia.org/wiki/Thanos#Alerts and that linked dashboard sure is not dashboarding [22:04:42] `PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled` [22:05:22] anyone from o11y lurking? otherwise we'll do our best :) [22:05:42] FIRING: JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:05:42] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:05:50] I don't think anyone's in-timezone except potentially cwhite [22:05:55] it says it might self recover, if systemd kills the processes [22:06:04] ye [22:06:05] rzl: here, let me read the backlog. [22:06:21] o/ [22:06:30] hi both :) thanks, not sure if we need you yet but rather have you around if you're nearby [22:06:49] hey not at a computer atm but can get to one if needed [22:07:07] I recently upgraded rsyslog on centrallog1002 but I don't think that's the cause of the issue. [22:07:36] !log restart thanos-query on titan1001 [22:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:38] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:59] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet [22:09:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts lvs3009.esams.wmnet [22:09:30] looking better -- not sure if that was a QOD or what [22:10:20] looks more like CPU pressure than memory pressure [22:10:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fb7a4c24f50: Failed to establish a new connection: [Errno 113 [22:10:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:10:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:10:44] (and dropping gradually but I assume that's because of time-averaging) [22:11:17] actually titan1002 cpu is still maxed, not sure what to make of that [22:11:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [22:11:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:12:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:53] cwhite: any thoughts on bouncing the service at titan1002 too? [22:14:53] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814251 (10RobH) NIC updated. @ssingh: I'll let this sit idle for a day or so and we can see if it errors, if not can we then return to service and check for errors this week whil... [22:15:09] rzl: yeah, probably should be done there too. [22:15:25] will do, unless you're about to [22:15:39] feel free :) [22:15:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P75941 and previous config saved to /var/cache/conftool/dbconfig/20250512-221556-fceratto.json [22:16:13] !log rzl@titan1002:~$ sudo systemctl restart thanos-query [22:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:48] probably unrelated but the elasticsearch issue is due to a network config problem on lvs1019/switch. Vlan missing for newly provisioned rack, I'm adding now. [22:16:58] rack e8/f8 in eqiad affected by that [22:16:59] can I deploy something before pc purge script kicks in? [22:17:16] Amir1: give it 2m just to make sure thanos comes back okay [22:17:21] sure! [22:17:27] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10814278 (10Jdlrobson-WMF) [22:17:31] (that thing makes a lot of deprecation warning logs) [22:18:07] cwhite: I'm happy if you are, I don't see a smoking-gun trigger yet though [22:18:15] "smoking-gun trigger" might be a mixed metaphor [22:18:44] denisse too, don't know if you've found anything :) [22:20:24] (03PS1) 10Cwhite: prometheus: add more recording rules around editResponseTime [puppet] - 10https://gerrit.wikimedia.org/r/1144662 (https://phabricator.wikimedia.org/T391677) [22:20:40] Amir1: floor's all yours as far as I'm concerned [22:20:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:01] Things LGTM!! Thanks everyone!! [22:21:28] Thank you! [22:21:35] (03CR) 10Ladsgroup: [C:03+2] objectcache: Cast explicitly to integer [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144502 (https://phabricator.wikimedia.org/T393879) (owner: 10Ladsgroup) [22:22:26] rzl: I strongly suspect a QOD. I happened to be asking thanos about editResponseTime around the same time. We know that metric is problematic (T391677). [22:22:27] T391677: Audit dashboards using histogram_quantile on mediawiki_WikimediaEvents_editResponseTime - https://phabricator.wikimedia.org/T391677 [22:22:37] aha [22:23:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144502 (https://phabricator.wikimedia.org/T393879) (owner: 10Ladsgroup) [22:25:17] (03CR) 10Cwhite: prometheus: add more recording rules around editResponseTime (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144662 (https://phabricator.wikimedia.org/T391677) (owner: 10Cwhite) [22:31:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T392806)', diff saved to https://phabricator.wikimedia.org/P75942 and previous config saved to /var/cache/conftool/dbconfig/20250512-223103-fceratto.json [22:31:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:31:25] (03PS1) 10Cathal Mooney: Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1144666 (https://phabricator.wikimedia.org/T382017) [22:31:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75943 and previous config saved to /var/cache/conftool/dbconfig/20250512-223131-fceratto.json [22:36:03] (03PS2) 10Cathal Mooney: Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1144666 (https://phabricator.wikimedia.org/T393911) [22:39:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75944 and previous config saved to /var/cache/conftool/dbconfig/20250512-223915-fceratto.json [22:39:25] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver: monitor compilation times - https://phabricator.wikimedia.org/T393978 (10jhathaway) 03NEW [22:39:38] (03Merged) 10jenkins-bot: objectcache: Cast explicitly to integer [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1144502 (https://phabricator.wikimedia.org/T393879) (owner: 10Ladsgroup) [22:39:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver: monitor compilation times - https://phabricator.wikimedia.org/T393978#10814390 (10jhathaway) p:05Triage→03Medium [22:39:53] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1144502|objectcache: Cast explicitly to integer (T393879)]] [22:39:56] T393879: PHP Deprecated: Implicit conversion from float 41.45954849253939 to int loses precision - https://phabricator.wikimedia.org/T393879 [22:41:05] (03CR) 10Cwhite: "statsd_exporter will continue to send metrics to Prometheus. This patch disables sending another copy of those metrics to graphite too." [puppet] - 10https://gerrit.wikimedia.org/r/1144553 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [22:44:33] (03CR) 10Cwhite: "The default policy is DROP - will this cause `nc` to hang?" [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [22:44:34] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1144502|objectcache: Cast explicitly to integer (T393879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:44:47] (03CR) 10Jdlrobson: [C:03+1] Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [22:44:51] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [22:45:08] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver: monitor HTTP request latencies - https://phabricator.wikimedia.org/T393979 (10jhathaway) 03NEW [22:45:14] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver: monitor HTTP request latencies - https://phabricator.wikimedia.org/T393979#10814421 (10jhathaway) p:05Triage→03Medium [22:46:07] (03PS1) 10Eevans: cassandra-dev2001: assign data_file_directories for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144667 (https://phabricator.wikimedia.org/T391544) [22:46:14] (03CR) 10Cwhite: airflow: disable statsd_exporter relaying to graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [22:46:31] (03PS1) 10RLazarus: scap: Loud deprecation warning for mwscript, now officially unsupported [puppet] - 10https://gerrit.wikimedia.org/r/1144668 (https://phabricator.wikimedia.org/T341553) [22:47:51] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: assign data_file_directories for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144667 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [22:49:34] (03CR) 10BCornwall: [C:03+1] Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1144666 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [22:50:01] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver: monitor compilation times - https://phabricator.wikimedia.org/T393978#10814433 (10jhathaway) [22:51:27] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144502|objectcache: Cast explicitly to integer (T393879)]] (duration: 11m 33s) [22:51:30] T393879: PHP Deprecated: Implicit conversion from float 41.45954849253939 to int loses precision - https://phabricator.wikimedia.org/T393879 [22:52:52] rzl: I'm done, just saying that these spikes shouldn't happen anymore: https://logstash.wikimedia.org/goto/98e82340c0fe7cb5cbf9c6d2c0d3ecdb if you see them, please let me know [22:53:09] (part of migration to mw-cron) [22:53:34] Amir1: sure, I'll keep an eye out 👍 [22:54:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P75945 and previous config saved to /var/cache/conftool/dbconfig/20250512-225422-fceratto.json [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250512T2300) [23:02:04] (03CR) 10Cathal Mooney: [C:03+2] Add Eqiad rack E8 and F8 vlan interfaces on eqiad lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1144666 (https://phabricator.wikimedia.org/T393911) (owner: 10Cathal Mooney) [23:02:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f6ed197cfd0: Failed to establish a new connection: [Errno 113 [23:02:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:04:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [23:04:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:09:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P75946 and previous config saved to /var/cache/conftool/dbconfig/20250512-230930-fceratto.json [23:24:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T392806)', diff saved to https://phabricator.wikimedia.org/P75948 and previous config saved to /var/cache/conftool/dbconfig/20250512-232437-fceratto.json [23:24:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [23:25:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T392806)', diff saved to https://phabricator.wikimedia.org/P75949 and previous config saved to /var/cache/conftool/dbconfig/20250512-232504-fceratto.json [23:27:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f8472ec4ed0: Failed to establish a new connection: [Errno 113 [23:27:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:28:43] (03PS1) 10Eevans: cassandra-dev2001: properly assign data_file_directories for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144675 (https://phabricator.wikimedia.org/T391544) [23:29:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [23:29:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:30:10] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: properly assign data_file_directories for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1144675 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [23:31:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T392806)', diff saved to https://phabricator.wikimedia.org/P75950 and previous config saved to /var/cache/conftool/dbconfig/20250512-233142-fceratto.json [23:32:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:35:42] (03PS1) 10Eevans: cassandra-dev2001: JBOD: hints, commitlog, caches & heapdumps [puppet] - 10https://gerrit.wikimedia.org/r/1144678 (https://phabricator.wikimedia.org/T391544) [23:36:58] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: JBOD: hints, commitlog, caches & heapdumps [puppet] - 10https://gerrit.wikimedia.org/r/1144678 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1144680 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1144680 (owner: 10TrainBranchBot) [23:44:04] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [23:44:16] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10814527 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cassandra-dev2001.codf... [23:46:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P75951 and previous config saved to /var/cache/conftool/dbconfig/20250512-234650-fceratto.json [23:48:47] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10814530 (10cmassaro) [23:49:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1144680 (owner: 10TrainBranchBot) [23:50:38] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10814531 (10cmassaro) @Eevans That's correct! I will do so. @BCornwall Yes, apologies! I will need to do this one more time in a couple of weeks, when I receive the correct computer (I... [23:56:24] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f3d1f180f10: Failed to establish a new connection: [Errno 113 [23:56:24] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:57:24] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5112, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 11, delayed_unassigned_shards [23:57:24] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.78528206129221 https://wikitech.wikimedia.org/wiki/Search%23Administration