[00:09:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210 [00:09:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210 (owner: 10TrainBranchBot) [00:11:06] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:21:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:34:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210 (owner: 10TrainBranchBot) [00:40:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:42:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:17] FIRING: JobQueueLowTrafficConsumerWidespreadHighLatency: ... [00:54:18] Processing delay times for low-traffic consumer rules are unusually high - https://wikitech.wikimedia.org/wiki/MediaWiki_JobQueue/Operations#JobQueueLowTrafficConsumerWidespreadHighLatency - https://grafana.wikimedia.org/d/fe130675-0c2d-4991-9dec-f54cf6a9c4d8/jobqueue-low-traffic-jobs?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DJobQueueLowTrafficConsumerWidespreadHighLatency [00:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:34:17] RESOLVED: JobQueueLowTrafficConsumerWidespreadHighLatency: ... [01:34:18] Processing delay times for low-traffic consumer rules are unusually high - https://wikitech.wikimedia.org/wiki/MediaWiki_JobQueue/Operations#JobQueueLowTrafficConsumerWidespreadHighLatency - https://grafana.wikimedia.org/d/fe130675-0c2d-4991-9dec-f54cf6a9c4d8/jobqueue-low-traffic-jobs?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DJobQueueLowTrafficConsumerWidespreadHighLatency [01:40:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:42:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:09:08] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:57:27] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:28:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:53:42] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:57] (03CR) 10Arnaudb: [C:03+2] gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [05:20:08] (03PS1) 10KartikMistry: Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662) [05:28:08] (03PS1) 10Marostegui: es1029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139215 (https://phabricator.wikimedia.org/T391921) [05:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1029 T391921', diff saved to https://phabricator.wikimedia.org/P75466 and previous config saved to /var/cache/conftool/dbconfig/20250428-052817-marostegui.json [05:28:23] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:28:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 and es2028 T391921', diff saved to https://phabricator.wikimedia.org/P75467 and previous config saved to /var/cache/conftool/dbconfig/20250428-052836-marostegui.json [05:29:03] (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:29:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1029.eqiad.wmnet with reason: Maintenance [05:29:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: Maintenance [05:30:29] (03CR) 10Marostegui: [C:03+2] es1029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139215 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:32:51] (03PS1) 10Marostegui: es2028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139216 (https://phabricator.wikimedia.org/T391921) [05:33:18] (03CR) 10Marostegui: [C:03+2] es2028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139216 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:35:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75468 and previous config saved to /var/cache/conftool/dbconfig/20250428-053532-root.json [05:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75469 and previous config saved to /var/cache/conftool/dbconfig/20250428-053829-root.json [05:42:07] !log Migrate es1029 and es2028 to MariaDB 10.11 T391921 [05:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:12] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:43:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:47:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030 and es2026 T391921', diff saved to https://phabricator.wikimedia.org/P75470 and previous config saved to /var/cache/conftool/dbconfig/20250428-054741-marostegui.json [05:47:47] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:48:28] (03PS1) 10Marostegui: es1030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139218 (https://phabricator.wikimedia.org/T391921) [05:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:48:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2026.codfw.wmnet,es1030.eqiad.wmnet with reason: Maintenance [05:49:34] (03CR) 10Marostegui: [C:03+2] es1030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139218 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:50:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75471 and previous config saved to /var/cache/conftool/dbconfig/20250428-055038-root.json [05:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75472 and previous config saved to /var/cache/conftool/dbconfig/20250428-055335-root.json [05:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75473 and previous config saved to /var/cache/conftool/dbconfig/20250428-055411-root.json [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75474 and previous config saved to /var/cache/conftool/dbconfig/20250428-060543-root.json [06:08:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75476 and previous config saved to /var/cache/conftool/dbconfig/20250428-060840-root.json [06:09:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75477 and previous config saved to /var/cache/conftool/dbconfig/20250428-060916-root.json [06:16:03] Deploying cxserver. [06:16:33] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662) (owner: 10KartikMistry) [06:18:17] (03Merged) 10jenkins-bot: Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662) (owner: 10KartikMistry) [06:20:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75480 and previous config saved to /var/cache/conftool/dbconfig/20250428-062049-root.json [06:22:24] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:22:48] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:23:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75481 and previous config saved to /var/cache/conftool/dbconfig/20250428-062346-root.json [06:24:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75482 and previous config saved to /var/cache/conftool/dbconfig/20250428-062422-root.json [06:25:19] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:25:52] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:26:51] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:27:26] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:29:11] !log Updated cxserver to 2025-04-25-063512-production (T392662) [06:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:16] T392662: /v2/suggest/sections/{title}/{from}/{to}: Error in your SQL syntax; check for the right syntax to use near ') - https://phabricator.wikimedia.org/T392662 [06:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75484 and previous config saved to /var/cache/conftool/dbconfig/20250428-063555-root.json [06:38:11] (03CR) 10Jelto: [C:03+2] gitlab: disable ci_secure_files object storage [puppet] - 10https://gerrit.wikimedia.org/r/1139007 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [06:38:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway) [06:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75485 and previous config saved to /var/cache/conftool/dbconfig/20250428-063851-root.json [06:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75486 and previous config saved to /var/cache/conftool/dbconfig/20250428-063927-root.json [06:39:44] (03PS1) 10Marostegui: es2026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139309 (https://phabricator.wikimedia.org/T391921) [06:40:24] (03CR) 10Marostegui: [C:03+2] es2026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139309 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:43:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:43:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75487 and previous config saved to /var/cache/conftool/dbconfig/20250428-064356-root.json [06:50:30] (03CR) 10Muehlenhoff: [C:03+2] Add trixie to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/1139037 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [06:51:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75489 and previous config saved to /var/cache/conftool/dbconfig/20250428-065100-root.json [06:52:37] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139016 (owner: 10Majavah) [06:53:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75490 and previous config saved to /var/cache/conftool/dbconfig/20250428-065357-root.json [06:54:19] (03PS1) 10Filippo Giunchedi: statistics: add statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) [06:54:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75491 and previous config saved to /var/cache/conftool/dbconfig/20250428-065433-root.json [06:55:53] (03CR) 10Elukey: [C:03+2] profile::prometheus::k8s: drop istio gateway labels for ML [puppet] - 10https://gerrit.wikimedia.org/r/1138313 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [06:56:25] (03PS2) 10Elukey: profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) [06:57:27] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75493 and previous config saved to /var/cache/conftool/dbconfig/20250428-065901-root.json [07:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:06:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75494 and previous config saved to /var/cache/conftool/dbconfig/20250428-070606-root.json [07:09:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75495 and previous config saved to /var/cache/conftool/dbconfig/20250428-070902-root.json [07:09:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75496 and previous config saved to /var/cache/conftool/dbconfig/20250428-070939-root.json [07:12:28] 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790 (10MoritzMuehlenhoff) 03NEW [07:14:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75497 and previous config saved to /var/cache/conftool/dbconfig/20250428-071408-root.json [07:18:51] (03CR) 10Filippo Giunchedi: [C:03+2] role: remove logstash role files [puppet] - 10https://gerrit.wikimedia.org/r/1138756 (owner: 10Filippo Giunchedi) [07:19:51] (03CR) 10Majavah: [C:03+2] hieradata: Expand GitLab blocklist for new WMCS IP space [puppet] - 10https://gerrit.wikimedia.org/r/1139016 (owner: 10Majavah) [07:21:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75498 and previous config saved to /var/cache/conftool/dbconfig/20250428-072111-root.json [07:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75499 and previous config saved to /var/cache/conftool/dbconfig/20250428-072408-root.json [07:24:18] (03CR) 10Filippo Giunchedi: "You are right, I was too hasty on this and jumped the gun on this. I'll abandon the review for now and we can keep iterating on T391687 wh" [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [07:24:26] (03Abandoned) 10Filippo Giunchedi: logstash: bump shards for logstash-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [07:24:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75500 and previous config saved to /var/cache/conftool/dbconfig/20250428-072444-root.json [07:25:48] (03PS1) 10Muehlenhoff: Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790) [07:29:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75501 and previous config saved to /var/cache/conftool/dbconfig/20250428-072914-root.json [07:29:17] !log upgrade thanos to 0.38 on titan1* - T383966 [07:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:21] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [07:33:29] (03PS1) 10Marostegui: instance.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) [07:34:47] !log upgrade thanos to 0.38 on titan2* - T383966 [07:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:51] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [07:36:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75502 and previous config saved to /var/cache/conftool/dbconfig/20250428-073617-root.json [07:37:51] (03PS1) 10Majavah: hieradata: Update Gerrit IPs in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793) [07:39:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75503 and previous config saved to /var/cache/conftool/dbconfig/20250428-073914-root.json [07:39:40] (03PS1) 10Majavah: cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) [07:39:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75504 and previous config saved to /var/cache/conftool/dbconfig/20250428-073950-root.json [07:43:46] (03PS1) 10Elukey: sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729) [07:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75506 and previous config saved to /var/cache/conftool/dbconfig/20250428-074419-root.json [07:46:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:48:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:53:42] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:51] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75508 and previous config saved to /var/cache/conftool/dbconfig/20250428-075455-root.json [07:56:27] (03CR) 10Elukey: [C:03+1] Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [07:59:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75509 and previous config saved to /var/cache/conftool/dbconfig/20250428-075924-root.json [07:59:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10771657 (10MoritzMuehlenhoff) [08:01:41] (03PS1) 10Ayounsi: gNMIc: collect transceivers states [puppet] - 10https://gerrit.wikimedia.org/r/1139410 (https://phabricator.wikimedia.org/T388641) [08:01:43] (03PS1) 10Ayounsi: Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) [08:02:05] (03PS2) 10Ayounsi: Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) [08:02:18] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi) [08:06:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:06:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:08:03] (03CR) 10David Caro: [C:03+1] "LGTM, the old ip points to:" [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:08:59] (03CR) 10David Caro: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:12:02] (03CR) 10Muehlenhoff: [C:03+2] Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [08:12:05] (03CR) 10Filippo Giunchedi: [C:03+1] profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [08:12:26] (03CR) 10Elukey: [C:03+2] profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [08:14:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75510 and previous config saved to /var/cache/conftool/dbconfig/20250428-081430-root.json [08:14:59] !log installing Linux 6.1.135 on Bookworm hosts [08:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:20:24] jouncebot: nowandnext [08:20:24] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [08:20:24] In 1 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000) [08:21:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah) [08:22:46] (03Merged) 10jenkins-bot: Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah) [08:23:33] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]] [08:23:37] T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689 [08:23:53] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10771693 (10tappof) Hey @wiki_willy, thanks for the feedback! I'll take a look at your request and let you know. [08:29:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75511 and previous config saved to /var/cache/conftool/dbconfig/20250428-082935-root.json [08:30:11] (03CR) 10Vgutierrez: [C:03+2] puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:37:10] (03CR) 10Ayounsi: [C:03+1] cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:37:18] (03CR) 10Ayounsi: [C:03+1] "thx!" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:38:37] !log taavi@deploy1003 taavi: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:38:43] T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689 [08:39:52] !log taavi@deploy1003 taavi: Continuing with sync [08:43:40] (03CR) 10Btullis: [C:03+1] "Cool." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138827 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [08:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75513 and previous config saved to /var/cache/conftool/dbconfig/20250428-084440-root.json [08:46:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:47:31] !log installing avahi security updates [08:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:31] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on krb1002.eqiad.wmnet with reason: work in progress, not yet active [08:49:20] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]] (duration: 25m 46s) [08:49:24] T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689 [08:49:51] (03CR) 10Majavah: [C:03+2] cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:51:16] (03Merged) 10jenkins-bot: cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:54:30] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796 (10MatthewVernon) 03NEW [08:55:11] !log update cr-cloud firewall policy for new gerrit ip address T392793 [08:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:15] (03CR) 10Hnowlan: [C:03+2] mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [08:55:27] jouncebot: nowandnext [08:55:27] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [08:55:27] In 1 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000) [08:57:09] (03CR) 10Majavah: [C:03+2] hieradata: Update Gerrit IPs in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [08:58:36] !log restarting blazegraph on wdqs1013 (deadlocked) [08:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] (03CR) 10FNegri: [C:03+1] P:wmcs: maintain_dbusers: Use cloud-private for ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1139019 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah) [09:00:40] (03CR) 10FNegri: [C:03+1] P:toolforge: disable_tool: Use ToolsDB internal IP instead [puppet] - 10https://gerrit.wikimedia.org/r/1139018 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah) [09:01:07] (03PS1) 10Filippo Giunchedi: thanos: raise rule log level to avoid log spam [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966) [09:01:09] (03CR) 10Majavah: [C:03+2] P:toolforge: disable_tool: Use ToolsDB internal IP instead [puppet] - 10https://gerrit.wikimedia.org/r/1139018 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah) [09:02:28] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:02:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:02:35] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:02:48] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [09:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:04:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:08:21] (03PS1) 10Hnowlan: mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) [09:08:45] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [09:09:59] (03PS2) 10Hnowlan: mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) [09:10:08] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi) [09:10:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10771815 (10MatthewVernon) p:05Triage→03High [09:10:57] (03CR) 10Filippo Giunchedi: [C:03+2] envoyproxy: tweak default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [09:15:10] (03PS1) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) [09:17:17] (03CR) 10Hnowlan: [C:03+2] mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [09:20:44] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:20:44] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:21:44] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:21:44] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:22:15] (03PS1) 10Samtar: errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) [09:25:29] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10771849 (10Silvan_WMDE) @Kirilloparma We have created and merged a patch that will ho... [09:27:07] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10771861 (10elukey) @herron now the citoid definition uses "raw" istio metrics, and from https://thanos.wikimedia.org/rules it seems that we are ranging at aro... [09:28:02] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:28:08] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:30:02] (03Abandoned) 10Hnowlan: Revert "debug: reorder debug backends for eqiad switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129297 (owner: 10Hnowlan) [09:34:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks good to me insofar as I understand it (very little ^^). Do we need to configure Prometheus to pull from this new exporter or will th" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi) [09:34:55] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: raise rule log level to avoid log spam [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [09:37:28] (03CR) 10Hasan Akgün (WMDE): "Same here, imo it's not a blocker for this patch to process but still something we should consider" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi) [09:38:42] FIRING: [8x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:40:20] (03Abandoned) 10Hnowlan: trafficserver: remove restbase from citoid request path everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1124418 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [09:41:51] (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from a TSP point of view." [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [09:42:32] (03PS2) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) [09:43:42] FIRING: [8x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:46:24] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5 [09:47:45] (03Abandoned) 10Hnowlan: mediawiki: miscellaneous bits of jobrunner cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1117525 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [09:48:30] (03CR) 10Majavah: [C:03+2] P:wmcs: maintain_dbusers: Use cloud-private for ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1139019 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah) [09:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:50:34] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10771922 (10LSobanski) [09:51:10] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis nupwiki in section s5 [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:34] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [09:53:53] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10771944 (10ops-monitoring-bot) VM ml-staging-ctrl2002.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [09:54:03] !log increase vcores and memory available for ml-staging-ctrl2* - T392289#10771944 [09:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] T392289: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289 [09:55:12] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [09:56:23] (03PS1) 10Hnowlan: mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:46] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [09:57:37] (03PS2) 10Hnowlan: mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) [09:58:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000) [10:01:59] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [10:02:21] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10771964 (10ops-monitoring-bot) VM ml-staging-ctrl2001.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [10:03:41] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [10:04:40] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [10:04:51] (03PS1) 10Hnowlan: mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) [10:06:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [10:09:13] fceratto@cumin1002 sanitize-wiki (PID 3414639) is awaiting input [10:10:48] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [10:18:46] (03PS3) 10Hashar: gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) [10:19:52] (03CR) 10Hashar: "Rebased due to the parent change ( Ib2302cc1ff7b49f58bac0eab8eea7c1fe68e62ea" [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [10:19:53] (03PS1) 10Lucas Werkmeister (WMDE): Migrate MediaWiki.wikibase.* stats [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) [10:20:16] (03PS1) 10Lucas Werkmeister (WMDE): Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) [10:20:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE)) [10:20:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE)) [10:23:35] (03PS1) 10Hnowlan: mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) [10:24:52] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [10:24:57] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan) [10:25:12] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan) [10:25:13] (03PS1) 10Muehlenhoff: kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429 [10:26:30] (03CR) 10Hashar: "I had the search console manually enabled to get access to the Google crawling dashboard and then attempt to fine tune what it is crawling" [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [10:32:27] jouncebot: nowandnext [10:32:27] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000) [10:32:27] In 2 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300) [10:32:37] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [10:34:46] (03PS5) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) [10:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:48] (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Configure statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344) [10:36:16] (03CR) 10Lucas Werkmeister (WMDE): "I hope this is the right host and port…" [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE)) [10:40:34] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [10:41:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [10:42:39] (03Merged) 10jenkins-bot: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [10:42:53] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]] [10:42:58] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [10:43:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:44:13] (03PS1) 10Hnowlan: mw:maintenance:updatequerypages: move all ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) [10:44:40] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [10:46:33] (03PS2) 10Hnowlan: mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) [10:47:30] (03PS1) 10Awight: Revert "Temporarily revoke ssh key for travel" [puppet] - 10https://gerrit.wikimedia.org/r/1139434 [10:47:40] !log samtar@deploy1003 samtar: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:48:31] !log samtar@deploy1003 samtar: Continuing with sync [10:48:49] (03CR) 10Elukey: [C:03+1] kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429 (owner: 10Muehlenhoff) [10:53:29] (03PS1) 10Ozge: feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436 [10:55:02] (03CR) 10Ozge: [V:03+2 C:03+2] feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436 (owner: 10Ozge) [10:55:12] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]] (duration: 12m 18s) [10:55:16] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [10:56:50] (03Merged) 10jenkins-bot: feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436 (owner: 10Ozge) [10:58:47] (03PS1) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) [10:58:49] (03PS1) 10Hnowlan: mw:maintenance: migrate all updatequerypages_ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139438 (https://phabricator.wikimedia.org/T388534) [10:59:14] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:00:15] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:00:26] (03CR) 10Hnowlan: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:06:33] (03PS2) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) [11:08:46] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772066 (10Silvan_WMDE) Until then: as noted above, the problem is not actually cause... [11:12:23] (03PS3) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) [11:13:07] (03PS1) 10Ladsgroup: EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784) [11:17:15] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:22:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester) [11:22:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester) [11:23:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [11:23:25] (03CR) 10Kamila Součková: [C:03+1] mediawiki::maintenance: migrate main startupregistrystats job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139020 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [11:23:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [11:25:19] jouncebot: nowandnext [11:25:19] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [11:25:19] In 1 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300) [11:25:33] (03CR) 10Ladsgroup: [C:03+2] EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup) [11:26:06] (03PS1) 10Majavah: P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443 [11:27:07] (03Merged) 10jenkins-bot: EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup) [11:28:42] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] [11:28:47] T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784 [11:28:48] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10772138 (10Jelto) I granted `gitlab-ro` read-only access to the GitLab object storage buckets `gitlab-packages` and `gitlab-artifa... [11:29:19] (03CR) 10Kamila Součková: "migration_title bad copypasta, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [11:30:08] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 267372 [11:30:09] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:30:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267372 [11:30:37] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264195 [11:30:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264195 [11:30:56] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:31:10] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61622 [11:31:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61622 [11:31:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264544 [11:31:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester) [11:31:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264544 [11:31:57] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 17072 [11:32:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17072 [11:32:23] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270589 [11:32:25] (03PS4) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) [11:32:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270589 [11:32:39] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 274607 [11:33:20] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:33:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274607 [11:33:42] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [11:34:52] (03CR) 10Hashar: [C:03+1] "Oh perfect thank you very much! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [11:35:24] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:38:21] (03PS5) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) [11:40:57] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [11:41:23] (03CR) 10Ayounsi: [C:03+2] Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi) [11:41:26] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1139445 [11:41:57] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] (duration: 13m 15s) [11:42:02] T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784 [11:42:32] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org [11:43:45] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:43:57] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1139445 (owner: 10Hashar) [11:44:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:45:30] (03CR) 10Muehlenhoff: [C:03+2] kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429 (owner: 10Muehlenhoff) [11:45:57] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2003.wikimedia.org with reason: T392804 [11:47:34] !log push pfw policies - T392617 [11:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:51:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/460c1e7e3fee2d2e7ca4826011b5e66a4a6e79366c44ff434ebfa90fdadea433/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [11:52:20] !log installing avahi security updates [11:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:49] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:52:55] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:53:07] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org [11:53:42] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:51] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:54] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1004.wikimedia.org [11:57:49] (03CR) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [11:58:30] 06SRE, 06cloud-services-team, 10Horizon, 06serviceops, 10Striker: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478#10772235 (10taavi) /cc @Andrew Main thing to note here is that Horizon needs to be able to talk to cloud-realm services.... [11:58:55] (03CR) 10Hnowlan: [C:03+2] mediawiki::maintenance: migrate main startupregistrystats job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139020 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [11:59:11] (03PS1) 10Slyngshede: IDM/IDP: Patch management [dns] - 10https://gerrit.wikimedia.org/r/1139446 [12:01:20] (03CR) 10Slyngshede: [C:03+2] IDM/IDP: Patch management [dns] - 10https://gerrit.wikimedia.org/r/1139446 (owner: 10Slyngshede) [12:01:26] !log slyngshede@dns1004 START - running authdns-update [12:03:18] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan) [12:04:01] !log slyngshede@dns1004 END - running authdns-update [12:04:21] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan) [12:07:09] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp2004.wikimedia.org [12:08:05] (03CR) 10Kamila Součková: [C:03+1] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [12:08:45] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device pfw1a-codfw [12:09:42] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [12:10:54] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:11:00] (03CR) 10Kamila Součková: [C:03+1] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [12:11:00] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:11:08] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2004.wikimedia.org [12:11:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-codfw [12:11:24] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [12:11:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [12:13:25] !log installing sqlparse security updates [12:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:23] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad2002.codfw.wmnet with reason: T392804 [12:15:20] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [12:15:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [12:19:21] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists2001.wikimedia.org with reason: T392804 [12:19:39] (03PS1) 10Majavah: P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450 [12:19:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [12:20:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10772298 (10fgiunchedi) I was curious too how trixie + puppet 8 would look like and did some work in that direction, you can find the patches at `sandbox/filippo/pontoon-t... [12:21:17] !log repooling wdqs1013 [12:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:17] (03CR) 10Ladsgroup: [C:03+1] "We always forget this 😞" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:23:14] (03CR) 10Marostegui: "I didn't forget it, but I prefer to do it in different CR" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:24:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [12:24:06] (03PS1) 10Majavah: P:toolforge: prometheus: Remove duplication in relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/1139454 [12:24:06] (03PS1) 10Majavah: P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) [12:24:40] (03CR) 10Ladsgroup: [C:03+1] "I always do forget it 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:25:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5369/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [12:28:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [12:32:30] (03PS1) 10Ozge: feat: updates blubber yaml for articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460 [12:33:51] (03PS2) 10Ozge: feat: updates blubber yaml for articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460 (https://phabricator.wikimedia.org/T391679) [12:34:52] (03CR) 10Ozge: [V:03+2 C:03+2] feat: updates blubber yaml for articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge) [12:37:12] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:37:35] (03CR) 10Ladsgroup: [C:03+1] "Thanks! I can keep an eye on it." [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [12:38:32] (03CR) 10Filippo Giunchedi: [C:03+2] statistics: add statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi) [12:38:42] FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:50] (03CR) 10Filippo Giunchedi: [C:03+2] "Prometheus will pick up metrics by itself, no need for "job" anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi) [12:41:00] (03CR) 10David Caro: [C:03+1] P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450 (owner: 10Majavah) [12:41:16] (03CR) 10Majavah: [C:03+2] P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450 (owner: 10Majavah) [12:41:36] 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#10772373 (10MoritzMuehlenhoff) In addition to the puppet-agent forward port two more packages need to be built: - puppet agent 7 needs ruby-concurrent 1.1.x (since 1.2.x has breaking... [12:42:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10772374 (10MoritzMuehlenhoff) [12:43:41] (03CR) 10Filippo Giunchedi: [C:03+2] statistics::wmde: Configure statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE)) [12:43:42] FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:44] PROBLEM - Webrequests Varnishkafka log producer on cp5026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:43:47] !log installing werkzeug security updates [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:40] PROBLEM - Webrequests Varnishkafka log producer on cp5029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:44:40] PROBLEM - Webrequests Varnishkafka log producer on cp5028 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:44:41] PROBLEM - Webrequests Varnishkafka log producer on cp5032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:44:43] PROBLEM - Webrequests Varnishkafka log producer on cp5031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:44:46] huh [12:44:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet [12:45:42] PROBLEM - Webrequests Varnishkafka log producer on cp5025 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:45:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:45:46] yeah [12:45:48] !incidents [12:45:48] 6056 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:45:52] !ack 6056 [12:45:52] 6056 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:46:27] can I help with the incident sukhe ? expected ? [12:46:44] godog: no, not expected. a huge spike in upload@eqsin [12:46:52] looking as soon as superset loads for me :] [12:46:52] ack, checking too [12:48:42] FIRING: JobUnavailable: Reduced availability for job varnish-upload in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:53] (03PS2) 10AOkoth: miscweb: change os-reports runtime owner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) [12:50:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:51:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet [12:51:42] RECOVERY - Webrequests Varnishkafka log producer on cp5028 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:51:58] PROBLEM - Hadoop NodeManager on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:53:42] RESOLVED: JobUnavailable: Reduced availability for job varnish-upload in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:46] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:57:33] !log test `host-inbound-traffic system-services` on pfw1-codfw - T390052 [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:37] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300). [13:00:05] tgr, Lucas_WMDE, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:19] are we okay to deploy right now? cc godog sukhe [13:00:38] (Hey.) [13:01:00] I'll be here in half an hour, can self-deploy [13:01:07] Lucas_WMDE: AFAICT yes, thanks for checking [13:01:11] ok thanks [13:01:16] I’ll start with my backports then [13:01:20] yep [13:01:39] and use spiderpig again just for the heck of it [13:02:10] Ooh, fancy. [13:02:23] Lucas_WMDE: Want to sling out my backport at the same time? It's a trivial logspam fix. [13:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE)) [13:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE)) [13:02:51] And "no" is a reasonable response, i can do it myself after you if you want. :-) [13:03:11] James_F: sorry, I already started it now [13:03:18] and I think I’d prefer to do it separately [13:03:19] I see. No worries. [13:03:21] but I can at least take a look at it now ^^ [13:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:04:20] RECOVERY - Webrequests Varnishkafka log producer on cp5025 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:04:20] RECOVERY - Webrequests Varnishkafka log producer on cp5026 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:05:20] RECOVERY - Webrequests Varnishkafka log producer on cp5029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:06:37] !log clearing up Icinga alerts on cp50* [13:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:18] RECOVERY - Webrequests Varnishkafka log producer on cp5032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:07:20] RECOVERY - Webrequests Varnishkafka log producer on cp5031 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:10:58] RECOVERY - Hadoop NodeManager on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:13:10] (03PS5) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [13:16:13] (03PS3) 10Effie Mouzeli: Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [13:16:15] (03CR) 10Ladsgroup: [C:03+2] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [13:16:26] (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [13:16:29] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [13:18:46] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:19:26] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [13:19:35] (03Merged) 10jenkins-bot: Migrate MediaWiki.wikibase.* stats [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE)) [13:19:39] (03Merged) 10jenkins-bot: Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE)) [13:19:55] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]] [13:20:01] T359251: [REPO][SW][GRAFMIGR] (mw.track) Migrate MediaWiki.wikibase.repo.* to statslib - https://phabricator.wikimedia.org/T359251 [13:20:02] T359252: [GRAFMIGR] Migrate MediaWiki.wikibase.view.* to statslib - https://phabricator.wikimedia.org/T359252 [13:20:02] T359253: [CLIENT][SW][GRAFMIGR] Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib - https://phabricator.wikimedia.org/T359253 [13:24:18] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:29] I can try to test a little bit [13:25:58] (03CR) 10Kamila Součková: [C:03+2] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [13:26:56] looks good! [13:27:03] I see something in https://thanos.wikimedia.org/graph?g0.expr=mediawiki_WikibaseRepo_EditEntity_attemptSave_duration_seconds_sum&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant= [13:27:07] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:27:31] James_F: want to start CI for your backport already? or do you want to do the config changes first? [13:29:58] Lucas_WMDE: Sure. [13:30:14] (03CR) 10Jforrester: [C:03+2] Fix: PHP Warning: Undefined array key "request" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester) [13:30:57] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough [13:31:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:36] ^ expected, reboots in progress [13:31:41] double so for the DNS ones starting soon [13:32:40] (03PS1) 10Esanders: Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) [13:32:46] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox [13:32:46] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org [13:32:48] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:28] (03CR) 10Hnowlan: [C:03+2] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [13:33:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:48] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:48] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]] (duration: 13m 52s) [13:33:52] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [13:33:54] T359251: [REPO][SW][GRAFMIGR] (mw.track) Migrate MediaWiki.wikibase.repo.* to statslib - https://phabricator.wikimedia.org/T359251 [13:33:54] T359252: [GRAFMIGR] Migrate MediaWiki.wikibase.view.* to statslib - https://phabricator.wikimedia.org/T359252 [13:33:55] T359253: [CLIENT][SW][GRAFMIGR] Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib - https://phabricator.wikimedia.org/T359253 [13:33:58] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [13:34:07] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:34:34] (03Merged) 10jenkins-bot: Fix: PHP Warning: Undefined array key "request" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester) [13:34:56] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:35:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.70 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:35:58] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:00] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:07] sorry, I got distracted for a second [13:36:12] James_F: you’re good to go [13:36:16] unless you want me to do the deploy [13:36:17] Ack. [13:36:20] I'll do it. [13:36:22] ok [13:36:37] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:37:12] !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1138823"' [13:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] tgr_: Did you want me to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1136132 whilst I'm at it? [13:37:43] Oh, wait, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1135060 isn't in prod at all yet, I presume this should wait? [13:38:10] (03CR) 10Jforrester: "https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1135060 only just landed last week; do this need to wait until that is everywhere (wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [13:38:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester) [13:38:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [13:38:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [13:39:30] (03Merged) 10jenkins-bot: Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester) [13:39:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [13:40:10] FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:40:11] (03Merged) 10jenkins-bot: manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [13:40:15] (03Merged) 10jenkins-bot: nupwiki: Enable Parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [13:40:52] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]] [13:40:59] T392026: PHP Warning: Undefined array key "request" - https://phabricator.wikimedia.org/T392026 [13:40:59] T376827: Add a new checklist item to the Wiki creation process for Parsoid Read Views - https://phabricator.wikimedia.org/T376827 [13:41:00] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [13:41:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7003.magru.wmnet} and A:liberica [13:42:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7003.magru.wmnet} and A:liberica [13:43:03] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: add conditional to enable ECH (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:43:08] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:08] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:08] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:08] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [13:43:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [13:43:48] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:48] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [13:44:18] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [13:44:40] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2092 to cirrussearch2092 [13:45:03] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:45:04] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:45:10] RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:45:10] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:45:24] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]] synced to the testser [13:45:24] vers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:46:10] !log jforrester@deploy1003 jforrester: Continuing with sync [13:47:43] James_F: uh yeah, I din't think that one through [13:47:47] (03PS1) 10DDesouza: Design Research Participant Survey: Increase Coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) [13:47:48] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:47:48] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:47:49] I'll move it to next week [13:47:56] tgr_: No worries, I didn't merge it anyway. :-) [13:48:00] <3 [13:48:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [13:48:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [13:48:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns1004.wikimedia.org [13:48:48] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:48:48] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:48:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns1004.wikimedia.org [13:49:14] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org [13:49:26] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org [reason: reboot finished] [13:49:34] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:44] PROBLEM - Bird Internet Routing Daemon on durum3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:49:46] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:46] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2092 to cirrussearch2092 - bking@cumin2002" [13:49:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [13:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [13:50:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [13:50:14] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum3003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:50:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:51:30] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: testing ECH [13:51:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2092 to cirrussearch2092 - bking@cumin2002" [13:51:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:37] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2092 [13:51:59] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan) [13:52:49] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]] (durati [13:52:49] on: 11m 56s) [13:52:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [13:52:55] T392026: PHP Warning: Undefined array key "request" - https://phabricator.wikimedia.org/T392026 [13:52:55] T376827: Add a new checklist item to the Wiki creation process for Parsoid Read Views - https://phabricator.wikimedia.org/T376827 [13:52:55] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [13:53:14] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2092 [13:53:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:53:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2092 to cirrussearch2092 [13:54:24] !log Deployment window complete. [13:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2092.codfw.wmnet with OS bullseye [13:54:55] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2092 [13:56:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:56:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [13:57:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:57:22] (03PS3) 10Jforrester: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [13:57:23] (03PS4) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [13:57:49] (03CR) 10Jforrester: [C:03+1] "PS3: Rebase and re-gen to add nupwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [13:58:03] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: testing ECH [13:58:16] (03CR) 10CI reject: [V:04-1] Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [13:58:23] (03CR) 10Jforrester: [C:03+1] "PS4: Rebase over my addition of the new parsoidrendered dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [13:58:39] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:58:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:58:46] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:59:06] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:59:32] (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Add missing spaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 [13:59:34] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:39] jouncebot: nowandnext [13:59:40] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300) [13:59:40] In 1 hour(s) and 30 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1530) [13:59:43] (03CR) 10Hnowlan: [C:03+2] mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan) [14:00:37] I’ll quickly roll out that code style cleanup [14:01:07] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2092 - bking@cumin2002" [14:01:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2092 - bking@cumin2002" [14:01:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:01:14] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2092.codfw.wmnet 228.16.192.10.in-addr.arpa 8.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:01:15] (03PS5) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:01:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2092.codfw.wmnet 228.16.192.10.in-addr.arpa 8.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:01:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2092 [14:01:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE)) [14:01:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2092 [14:02:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2092 [14:02:41] (03Merged) 10jenkins-bot: manage-dblist: Add missing spaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE)) [14:02:57] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]] [14:02:58] (03PS1) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481 [14:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:04:14] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org [14:06:10] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:14] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum3003 is OK: OK: UP (pid=1383598) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:07:18] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:24] I’ll do a very cursory test [14:07:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:48] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:10] seems to work afaict [14:08:15] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [14:08:46] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:50] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:50] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:09:34] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:09:34] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:09:44] RECOVERY - Bird Internet Routing Daemon on durum3003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:09:46] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:46] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:46] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:09:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:09:50] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:09:50] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:08] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:48] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:48] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:10] FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:33] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772794 (10ArthurPSmith) @Silvan_WMDE Thanks for working on this! I would note that t... [14:12:00] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:13:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org [14:13:56] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:13:56] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:14:56] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:14:56] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:15:09] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]] (duration: 12m 12s) [14:16:10] FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:16:28] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people2003.codfw.wmnet with reason: T391357 [14:16:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad. [14:16:32] ikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worke [14:16:32] iad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:16:32] T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357 [14:16:40] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1049.eqiad. [14:16:40] ikikube-worker1094.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worke [14:16:40] iad.wmnet, wikikube-worker1102.eqiad.wmnet, wikikube-worker1002.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, wikikube-worker1130.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:16:46] hello [14:16:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:58] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:23] !incidents [14:17:24] 6058 (UNACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [14:17:24] 6056 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:17:26] !ack 6058 [14:17:27] 6058 (ACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [14:17:50] thank you sukhe [14:17:54] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772800 (10elukey) The host was reimaged on the 5th afaics: ` 2024-05-06 09:10:50,421 marostegui 595479 [DEBUG _cookbook.py:511 in main] Executing cookbook sre.hosts.reimage with args: ['--os', 'bookworm', '-t', 'T363... [14:18:00] godog: no worries, now on to finding out how to debug this :D [14:18:05] lol indeed [14:18:09] o/ [14:18:10] * Lucas_WMDE done deploying btw [14:18:13] I will have a look also [14:18:16] hi hnowlan :) [14:18:16] <3 [14:18:56] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:18:56] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:19:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2092.codfw.wmnet with reason: host reimage [14:19:04] thanks hnowlan <3 [14:19:14] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on planet2003.codfw.wmnet with reason: reboot [14:19:45] huh, every worker is busy [14:19:47] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on planet1003.eqiad.wmnet with reason: reboot [14:20:26] mhm [14:20:28] that should be highly unlikely, but we have 4 instances of mw-videoscaler running in parallel [14:20:41] temporary fix is to bump replicas and let this get cleaned up [14:20:56] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:20:56] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:21:10] FIRING: [14x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:21:20] hnowlan: I can do that [14:22:04] (03PS1) 10Hnowlan: shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 [14:22:14] Raine: oh sorry, was in the other tab doing ^ :D [14:22:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2092.codfw.wmnet with reason: host reimage [14:22:18] (03CR) 10Ssingh: [C:03+1] shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan) [14:22:35] hnowlan: oh, okay, thanks :D [14:22:44] that comment above the value is looking a little silly now >_> [14:23:07] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:09] someone must have uploaded something big [14:23:37] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:42] FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:03] (03CR) 10Hnowlan: [C:03+2] shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan) [14:24:35] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:25:09] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:25:55] (03Merged) 10jenkins-bot: shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan) [14:26:06] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:26:10] RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:26:37] aghhh this is going to fail because it'll hit the resource limits [14:26:46] uh [14:27:18] (03CR) 10Jforrester: "Oh, sorry, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE)) [14:27:37] (maybe) [14:28:00] (03CR) 10Lucas Werkmeister (WMDE): "np, not your fault that phpcs isn’t running :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE)) [14:28:07] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:35] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:28:35] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [14:28:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:28:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org [14:29:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:37] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:48] thanks hnowlan! [14:30:09] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:33] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772825 (10taavi) Anything left to do here? [14:30:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10772827 (10elukey) p:05Triage→03Medium [14:30:55] sukhe: it might come back unfortunately, I'll keep looking [14:31:06] (03CR) 10CDanis: [C:03+1] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [14:31:20] hnowlan: hth if on-callers can, please let usknow [14:31:24] (03CR) 10Muehlenhoff: "Was approved in the weekly SRE IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [14:31:43] 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#10772839 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:31:52] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772840 (10Silvan_WMDE) >>! In T374230#10772794, @ArthurPSmith wrote: > Does the fix... [14:31:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:01] nice [14:32:01] sukhe: thanks. The good news is that as it stands this isn't creating user-facing errors [14:32:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10772843 (10ayounsi) p:05Triage→03Medium [14:32:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad. [14:32:31] ikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1161.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1069.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worke [14:32:31] iad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:32:36] hmm [14:32:41] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1067.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad. [14:32:41] ikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke [14:32:41] iad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1256.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:32:45] sigh [14:32:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:49] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:01] !ack 6059 [14:33:02] 6059 (ACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [14:33:18] hnowlan: I guess time to track down what's causing this? [14:33:24] the 4 scap runs in series created 4 workers [14:33:30] all of which are doing long-running transcodes [14:33:33] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772850 (10jhathaway) Thanks @elukey, perhaps puppetserver needs to be reloaded to pick up the revoke, and this didn't happen until more recently? [14:33:49] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:59] unfortunately I think the best way to stem the bleeding is to kill one which will cause a small number of transcodes to fail, but otherwise we can't be use about how long this will go on [14:34:14] 06SRE, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10772852 (10joanna_borun) [14:34:25] hnowlan: those transcodes might even get retried automatically, no? [14:34:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:34:45] maybe, hopefully :) [14:35:01] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on planet2003.codfw.wmnet with reason: T391357 [14:35:05] go for it then [14:35:05] T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357 [14:35:19] recoveries coming in again. what do we usually do when a long running transcode is in progress like this? [14:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:28] ah nvm, I see the message at :33 [14:35:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:35:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:35:49] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#10772856 (10Krinkle) [14:35:49] deleted a pod [14:36:16] !incidents [14:36:17] 6059 (ACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [14:36:17] 6058 (RESOLVED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [14:36:17] 6056 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:36:18] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10772857 (10Krinkle) [14:36:25] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:36:35] thank you hnowlan <3 [14:36:40] indeed <3 [14:36:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:36:47] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:36:47] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:37:41] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:37:49] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:37:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:09] PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:38:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad. [14:38:31] ikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke [14:38:31] iad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:38:41] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1025.eqiad. [14:38:41] ikikube-worker1315.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worke [14:38:41] iad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:38:41] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:49] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:38:52] Monday is turning out to be fun :) [14:39:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough [14:39:43] sorry about this, pods are restarting themselves (when they shouldn't be? unclear) [14:40:03] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases2003.codfw.wmnet with reason: T391357 [14:40:07] T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357 [14:40:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:40:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:41:19] okay, terminated all old videoscalers [14:41:25] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:41:38] -video pods will take a little bit to clear up though as the jobs have to finish, unfortunately [14:41:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:41:51] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:41:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:42] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stewards2001.codfw.wmnet with reason: T391357 [14:42:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org [14:42:51] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:43:07] ack, thanks for the update hnowlan [14:43:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2092.codfw.wmnet with OS bullseye [14:43:52] (03PS2) 10Ssingh: gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [14:44:55] because all replicas are busy in -video, the odds of the apply I did 10 minutes ago failing are quite high. if that happens, we might see another page, and if we do I will just manually bump replicas [14:45:10] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts2002.codfw.wmnet with reason: T391357 [14:45:14] T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357 [14:45:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [14:45:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:45:48] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772886 (10ssingh) >>! In T379927#10772825, @taavi wrote: > Anything left to do here? Nothing on the prod DNS hosts side; if you k... [14:46:10] (03CR) 10Ssingh: [C:03+2] gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [14:46:11] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [14:46:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10772887 (10MoritzMuehlenhoff) [14:46:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [14:46:22] hashar: ^ [14:46:29] deploying https://gerrit.wikimedia.org/r/c/operations/dns/+/1138996 [14:46:33] !log sukhe@dns1004 START - running authdns-update [14:47:43] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:47:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [14:47:55] !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech4_amd64.changes: T205378 [14:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:00] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [14:48:23] PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:01] thanks some more hnowlan <3 [14:49:02] !log sukhe@dns1004 END - running authdns-update [14:50:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [14:51:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:51:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [14:54:57] (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Fix indentation and stray blank line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) [14:54:58] (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Fix some random phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) [14:55:01] (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) [14:55:37] !log re-enable puppet and force agent run on A:durum [14:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:53] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3753 MB (3% inode=92%): /tmp 3753 MB (3% inode=92%): /var/tmp 3753 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [14:57:09] (03CR) 10Joal: [C:03+1] "+1, it doesn't really change anything on our end :)" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [14:57:31] !log CREATE INDEX cxs_source_language_title ON cx_suggestions (cxs_source_language, cxs_title); on wikishared (T390510) [14:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:42] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [14:57:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org [14:58:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2001.codfw.wmnet [14:58:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [14:59:04] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:50] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:01:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:04:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:04:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [15:05:24] RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:50] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:05:50] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:48] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:08:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [15:10:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:11:33] !log CREATE INDEX translation_started_by_last_updated_timestamp ON cx_translations (translation_started_by, translation_last_updated_timestamp); (T390510) [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:38] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [15:13:08] (03CR) 10LorenMora: [C:03+1] Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [15:13:56] PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:03] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10773031 (10tappof) @wiki_willy, I was able to split the PDUs in a 'per row' manner. If you're looking at a PoP, this is equiva... [15:14:05] sukhe@cumin1002 roll-reboot (PID 3625946) is awaiting input [15:14:21] er [15:14:25] (03CR) 10JHathaway: [C:03+1] sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729) (owner: 10Elukey) [15:15:14] (03CR) 10JHathaway: [C:03+2] puppetserver: update sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway) [15:15:14] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [reason: reboot finished] [15:15:25] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org [15:15:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org [15:15:54] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org [15:17:19] (03CR) 10Elukey: [C:03+2] sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729) (owner: 10Elukey) [15:17:56] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:19:56] RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:24:34] (03CR) 10JHathaway: "@mmuhlenhoff@wikimedia.org per our IRL discussion the other piece of timer validation is here, https://gerrit.wikimedia.org/r/plugins/giti" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [15:25:44] (03PS1) 10Ayounsi: Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503 [15:25:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:25:57] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2093 to cirrussearch2093 [15:26:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:26:42] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10773102 (10MoritzMuehlenhoff) [15:28:06] RECOVERY - Check unit status of backup-kdc-database on krb1002 is OK: OK: Status of the systemd unit backup-kdc-database https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:28:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1530). [15:30:25] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2093 to cirrussearch2093 - bking@cumin2002" [15:30:43] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2093 to cirrussearch2093 - bking@cumin2002" [15:30:43] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:44] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2093 [15:30:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:30:54] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org [15:31:03] !log lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull # T389344, I don’t want to wait until the next Puppet run in 26 minutes [15:31:03] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2093 [15:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:07] T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344 [15:31:20] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10773131 (10elukey) One thing that I see is that the reimage failed: ` 2024-05-06 10:31:27,368 marostegui 595479 [INFO _log.py:125 in log_task_end] END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178... [15:31:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2093 to cirrussearch2093 [15:31:46] RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [15:32:19] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2093.codfw.wmnet on all recursors [15:32:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) cirrussearch2093.codfw.wmnet on all recursors [15:32:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2093.codfw.wmnet with OS bullseye [15:33:11] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2093 [15:34:06] RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [15:34:34] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:34:50] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:34:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:26] I'd like to use `deleteBatch.php` to delete a set of broken Flow boards on gomwiki... any issues with doing so? [15:36:50] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:50] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2093 - bking@cumin2002" [15:38:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2093 - bking@cumin2002" [15:38:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2093.codfw.wmnet 229.16.192.10.in-addr.arpa 9.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:38:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2093.codfw.wmnet 229.16.192.10.in-addr.arpa 9.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:38:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2093 [15:38:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:39:02] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2093 [15:39:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2093 [15:39:26] !log installing edk2 security updates [15:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:45] right then, going ahead [15:42:53] !log zoe@deploy1003 manually-logged T389247 Beginning deletion of broken gomwiki flow boards [15:42:57] T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247 [15:43:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [15:43:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:45:29] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org [15:45:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:45:48] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox [15:46:12] !log pause execution of sre.dns.roll-reboot to figure out skipping of Icinga service warning [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:24] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2005.wikimedia.org [15:46:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2005.wikimedia.org [15:46:33] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2005.wikimedia.org [reason: reboot finished] [15:47:23] !log lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull --ff-only # T389344 [15:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344 [15:48:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:48:49] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:49:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [15:49:36] (03CR) 10Nik Gkountas: Catalog ContentTranslation tables (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [15:49:44] (03PS2) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) [15:51:22] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:51:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:41] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10773259 (10Jelto) a:03Jelto [15:51:48] (03CR) 10CI reject: [V:04-1] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [15:52:11] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:52:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [15:53:42] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:20] !log zoe@deploy1003 manually-logged T389247 Completed deletion of broken gomwiki flow boards [15:55:24] T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247 [15:56:19] (03CR) 10Fabfur: [C:03+2] cache: use fqdn in haproxykafka hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [15:56:26] bking@cumin2002 reimage (PID 1020203) is awaiting input [15:57:13] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10773280 (10LSobanski) p:05Triage→03Medium [15:57:16] (03PS3) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) [15:57:43] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:58:25] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834 (10Urbanecm_WMF) 03NEW [15:59:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:59:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:59:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75520 and previous config saved to /var/cache/conftool/dbconfig/20250428-155924-fceratto.json [15:59:36] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773310 (10Urbanecm_WMF) Feels like something's filling things up. I removed some files I no longer need in my home, which got it at 99% and 1.9G space available. At this point, less than 900M is available (so about a GB worth o... [15:59:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2093.codfw.wmnet with OS bullseye [15:59:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [16:00:01] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773313 (10Urbanecm_WMF) p:05Triage→03Unbreak! Provisionally, server fully out of space doesn't seem like a good idea. Feel free to lower if you think that's appropriate. [16:00:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [16:03:34] !log zoe@deploy1003 manually-logged T389247 attempting migration [16:03:38] T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247 [16:03:42] FIRING: [7x] ProbeDown: Service ganeti7003:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:49] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10773340 (10wiki_willy) Thanks @tappof, that looks perfect. Thanks for splitting it up by rack! I went through and checked th... [16:07:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75523 and previous config saved to /var/cache/conftool/dbconfig/20250428-160734-fceratto.json [16:07:36] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773343 (10Urbanecm_WMF) And we're at zero availability: ` [urbanecm@mwmaint1002 ~]$ df -h Filesystem Size Used Avail Use% Mounted on [...] /dev/mapper/mwmaint1002--vg-root 121G 116G 0 100% / [...]... [16:09:04] (03PS1) 10Ssingh: P:auth: temporarily skip returning a WARN on check_authdns_state [puppet] - 10https://gerrit.wikimedia.org/r/1139510 [16:10:13] (03CR) 10Ebernhardson: [C:04-1] "The .deb to be built is at https://gitlab.wikimedia.org/repos/search-platform/opensearch-madvise/" [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [16:11:04] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773350 (10elukey) ` elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail 553M /home/ebernhardson 842M /home/catrope 1.2G /home/brion 1.3G /home/tstarling 1.7G /home/oblivian 1.7G /home/samtar 2.1G /home/cparle 11G /ho... [16:11:18] (03CR) 10Ssingh: [C:03+2] P:auth: temporarily skip returning a WARN on check_authdns_state [puppet] - 10https://gerrit.wikimedia.org/r/1139510 (owner: 10Ssingh) [16:11:37] (03CR) 10Ssingh: [C:03+2] "self-merging since this is a trivial Icinga check change and will be reverted." [puppet] - 10https://gerrit.wikimedia.org/r/1139510 (owner: 10Ssingh) [16:11:44] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773353 (10dancy) Big directories are: `/var/log`: 42GB and ` 22.8 GiB [##########] /home/zabe 14.9 GiB [###### ] /home/ladsgroup 10.9 GiB [#### ] /home/legoktm ` [16:12:30] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773355 (10elukey) And also: ` elukey@mwmaint1002:/var/log/mediawiki$ sudo du -hs * | sort -h | tail 505M mediawiki_job_mediamoderation-hourlyScan 519M mediawiki_job_purge_checkuser 546M mediawiki_job_cirrus_build_completion_in... [16:13:56] hey; please could i get a second opinion on / please could someone check if they can reproduce T392832 on their device? i'm increasingly feeling like it might be severe enough to be a train blocker for the upcoming train, and if it is i want to get it flagged to the right people sooner rather than later :) [16:13:56] T392832: Unable to access the revision-deletion interface from Special:Log - an "Invalid target revision" error page is displayed - https://phabricator.wikimedia.org/T392832 [16:14:01] (asking in -operations rather than in #mediawiki or anywhere else because of my worry that this might be a train blocker) [16:14:04] !log force agent run on A:dnsbox to merge CR 1139510 [16:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:32] (03Abandoned) 10Hashar: [WIP] Stub LimeSurvey configuration [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis) [16:15:36] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773366 (10Ladsgroup) I think something is broken with log rotation. When I was checking logs for systemd timer logs, I found stuff from years ago. [16:16:19] A_smart_kitten: yeah. I can repro on deployment-prep [16:16:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:03] taavi: thanks for the check! [16:17:52] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773379 (10Ladsgroup) I deleted my old backup logs. That saves up 14GB but logs needs to be cleaned up. [16:20:19] (03PS1) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) [16:20:43] (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:21:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5370/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:21:09] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:21:24] (03CR) 10David Caro: [C:03+1] P:toolforge: prometheus: Use DNS names to look up scrape targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:21:33] (03CR) 10David Caro: [C:03+1] P:toolforge: prometheus: Use DNS names to look up scrape targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:21:54] (03PS2) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) [16:22:13] (03CR) 10Majavah: [C:03+2] P:toolforge: prometheus: Remove duplication in relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/1139454 (owner: 10Majavah) [16:22:20] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:22:25] (03Abandoned) 10Hashar: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [16:22:28] (03Abandoned) 10Hashar: sge: Fix global config handling [puppet] - 10https://gerrit.wikimedia.org/r/351379 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [16:22:34] (03Abandoned) 10Hashar: gridengine: Cleanup mergeconf script and references [puppet] - 10https://gerrit.wikimedia.org/r/352281 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [16:22:38] (03Abandoned) 10Hashar: gridengine: Cleanup old scripts, tracker and collector [puppet] - 10https://gerrit.wikimedia.org/r/352294 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [16:22:41] (03Abandoned) 10Hashar: gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [16:22:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P75525 and previous config saved to /var/cache/conftool/dbconfig/20250428-162242-fceratto.json [16:22:44] (03PS2) 10Majavah: P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) [16:22:45] (03Abandoned) 10Hashar: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [16:24:54] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7003.magru.wmnet [16:25:10] (03CR) 10Majavah: [V:03+2 C:03+2] P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [16:25:54] (03PS3) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) [16:26:32] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and not (A:eqiad or A:codfw) and A:dnsbox [16:26:33] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org [16:27:31] 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773450 (10Tgr) The GrowthExperiments logs seem properly rotated, there are daily logfiles going back two weeks, and the log entry dates match the file date. It just seems to be creating a huge amount of logs. [16:27:54] (03PS3) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) [16:28:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7003.magru.wmnet [16:28:18] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:29:48] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs7003*} and A:liberica [16:30:19] (03PS1) 10Gergő Tisza: mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) [16:30:32] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) pooling P{lvs7003*} and A:liberica [16:30:38] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:30:46] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:16] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773473 (10Urbanecm_WMF) Hmm... I just discovered mwmaint2002's disk is significantly larger than 1002's (430G vs 120G). Should we even have servers with the same role with very different diskspace? [16:33:19] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773474 (10Tgr) `listTaskCounts` uses `--output none` already, that 3G is entirely job runner boilerplate (a ton of rows like `Apr 18 15:11:00 mwmaint1002 mediawiki_job_growthexperiments-listTaskCounts[9828... [16:34:42] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:11] (03PS1) 10Zoe: Set flow boards readonly on fiwikimedia and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) [16:35:40] (03PS2) 10Zoe: Set flow boards readonly on fiwikimedia, gomwiki and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) [16:35:52] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [16:36:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [16:37:09] (03PS1) 10DCausse: cirrus: re-enable completion index rebuild in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1139518 [16:37:27] (03CR) 10BCornwall: "I forgot about this CR, sorry! I have since included this via I10c6d5e169972d44569b801d532d4759a6fd3e73" [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy) [16:37:35] (03Abandoned) 10BCornwall: certificates.yaml: Add pywikipedia.org to non-canonical-redirect [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy) [16:37:40] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:37:46] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:37:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P75526 and previous config saved to /var/cache/conftool/dbconfig/20250428-163749-fceratto.json [16:39:42] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:00] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org [16:45:54] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:45:54] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:46:42] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10773538 (10jhathaway) great thanks @fgiunchedi! [16:46:52] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:46:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:50:01] (03CR) 10JHathaway: "looks good, just a doc request" [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi) [16:52:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75527 and previous config saved to /var/cache/conftool/dbconfig/20250428-165257-fceratto.json [16:53:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [16:53:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75528 and previous config saved to /var/cache/conftool/dbconfig/20250428-165323-fceratto.json [16:55:00] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org [16:58:46] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:08] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1700) [17:00:04] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1700). [17:02:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75529 and previous config saved to /var/cache/conftool/dbconfig/20250428-170244-fceratto.json [17:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:04:46] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:05:10] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:32] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org [17:13:21] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people1004.eqiad.wmnet with reason: reboot [17:15:12] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on stewards1001.eqiad.wmnet with reason: reboot [17:17:01] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc2003.codfw.wmnet with reason: reboot [17:17:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P75530 and previous config saved to /var/cache/conftool/dbconfig/20250428-171752-fceratto.json [17:18:50] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on etherpad1004.eqiad.wmnet with reason: reboot [17:23:42] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:24:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75531 and previous config saved to /var/cache/conftool/dbconfig/20250428-172410-ladsgroup.json [17:24:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2093-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:26:27] FIRING: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cirrussearch2093:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:26:32] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org [17:27:54] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:29:52] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:29:52] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:29:54] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:30:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:30:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:31:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:32:10] FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:32:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75532 and previous config saved to /var/cache/conftool/dbconfig/20250428-173250-ladsgroup.json [17:32:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P75533 and previous config saved to /var/cache/conftool/dbconfig/20250428-173259-fceratto.json [17:34:41] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:35:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:36:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:36:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:37:14] (03CR) 10Dzahn: "This is correct but I would like to add that" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah) [17:37:50] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:52] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:37:54] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:39:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:42:10] RESOLVED: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:43:04] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org [17:45:52] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3040 MB (2% inode=92%): /tmp 3040 MB (2% inode=92%): /var/tmp 3040 MB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [17:46:20] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2100 to cirrussearch2100 [17:46:44] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:47:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P75534 and previous config saved to /var/cache/conftool/dbconfig/20250428-174757-ladsgroup.json [17:48:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75535 and previous config saved to /var/cache/conftool/dbconfig/20250428-174806-fceratto.json [17:48:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [17:48:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75536 and previous config saved to /var/cache/conftool/dbconfig/20250428-174831-fceratto.json [17:48:47] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:36] (03PS6) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [17:52:23] bking@cumin2002 rename (PID 1154737) is awaiting input [17:54:46] 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10773808 (10Yann) https://commons.wikimedia.org/wiki/File:Rembrandt_-_The_Abduction_of_Europa_-_Google_Art_Project.jpg thumbnails failed, but https://commons.wikimedia.org/wiki/File:Rembrandt_... [17:56:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75537 and previous config saved to /var/cache/conftool/dbconfig/20250428-175657-fceratto.json [17:58:04] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org [17:59:20] PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:59:54] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:01:52] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:01:54] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:54] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:03:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P75538 and previous config saved to /var/cache/conftool/dbconfig/20250428-180304-ladsgroup.json [18:03:39] (03PS1) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) [18:03:49] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:04:10] FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:04:42] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5371/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:05:41] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:05:51] (03PS2) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) [18:06:18] (03CR) 10AOkoth: miscweb: change os-reports runtime owner (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:06:55] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5372/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:07:06] (03PS3) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) [18:07:26] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2100 to cirrussearch2100 - bking@cumin2002" [18:07:43] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2100 to cirrussearch2100 - bking@cumin2002" [18:07:43] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:44] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2100 [18:07:54] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2100 [18:08:06] !log CREATE INDEX translation_last_update_by_last_updated_timestamp ON cx_translations (translation_last_update_by, translation_last_updated_timestamp); (T392839 and T390510) [18:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:12] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [18:08:12] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [18:08:14] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:08:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2100 to cirrussearch2100 [18:08:52] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:54] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:08:54] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:09:16] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2100.codfw.wmnet on all recursors [18:09:20] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2100.codfw.wmnet on all recursors [18:09:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2100.codfw.wmnet with OS bullseye [18:09:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2100 [18:10:05] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:10:41] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:11:47] (03PS1) 10Ssingh: Revert "P:auth: temporarily skip returning a WARN on check_authdns_state" [puppet] - 10https://gerrit.wikimedia.org/r/1139529 [18:11:56] !log CREATE INDEX cxl_owner ON cx_lists (cxl_owner); (T392839 and T390510) [18:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P75539 and previous config saved to /var/cache/conftool/dbconfig/20250428-181204-fceratto.json [18:12:46] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: T392804 [18:13:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:14:10] RESOLVED: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:14:39] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: T392804 [18:14:39] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns4004.wikimedia.org [18:14:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns4004.wikimedia.org [18:14:52] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: reboot finished] [18:15:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org [18:15:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2100 - bking@cumin2002" [18:15:26] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2100 - bking@cumin2002" [18:15:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:15:26] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2100.codfw.wmnet 219.32.192.10.in-addr.arpa 9.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:15:30] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2100.codfw.wmnet 219.32.192.10.in-addr.arpa 9.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:15:31] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2100 [18:15:43] !log sukhe@dns1004 START - running authdns-update [18:18:09] !log sukhe@dns1004 END - running authdns-update [18:18:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75540 and previous config saved to /var/cache/conftool/dbconfig/20250428-181811-ladsgroup.json [18:18:33] bking@cumin2002 reimage (PID 1179440) is awaiting input [18:23:26] (03PS1) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) [18:24:58] (03PS4) 10Ryan Kemper: sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) [18:27:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P75542 and previous config saved to /var/cache/conftool/dbconfig/20250428-182711-fceratto.json [18:28:55] (03CR) 10Ryan Kemper: "That's a good point, I think the backend request will be a bit more straightforward. I'll try that approach first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138935 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [18:29:09] (03CR) 10Ladsgroup: "If Growth team is okay with it, I can deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza) [18:29:20] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [18:30:03] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aphlict2001.codfw.wmnet with reason: Bookworm Reboot [18:30:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org [18:30:49] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aphlict2001.codfw.wmnet [18:31:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:06] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:02] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:34:02] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:34:10] ^ expected [18:34:37] !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aphlict2001.codfw.wmnet [18:34:56] (03CR) 10BCornwall: [C:03+1] P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:35:26] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:10] FIRING: BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:37:16] !log run agent on A:durum [18:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:35] (03PS1) 10AOkoth: aphlict: ensure on passive host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) [18:40:02] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:02] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:42:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75543 and previous config saved to /var/cache/conftool/dbconfig/20250428-184217-fceratto.json [18:42:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:42:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75544 and previous config saved to /var/cache/conftool/dbconfig/20250428-184243-fceratto.json [18:43:24] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: T392804 [18:45:08] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: T392804 [18:45:28] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet [18:46:10] RESOLVED: BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:47:50] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org [18:48:50] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844 (10RobH) 03NEW [18:49:11] !log brett@cumin2002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on 14 hosts with reason: upgrades [18:49:11] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10773978 (10RobH) [18:49:57] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845 (10RobH) 03NEW [18:50:01] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir1001.eqiad.wmnet [18:50:03] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 14 hosts with reason: upgrades [18:50:18] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10773997 (10RobH) [18:50:22] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir1002.eqiad.wmnet [18:50:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75545 and previous config saved to /var/cache/conftool/dbconfig/20250428-185043-fceratto.json [18:50:56] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10773999 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se... [18:51:09] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10774003 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se... [18:54:49] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir1002.eqiad.wmnet [18:55:21] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2001.codfw.wmnet [18:57:05] (03CR) 10Dzahn: "The message says this enables it on the passive host. But it's disabling it on the active host." [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [18:57:33] (03PS1) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) [18:58:31] (03CR) 10Dzahn: "You can either just delete any setting here at hosts level.. it would still be present on both and should be no change at all in compiler." [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [18:58:32] 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774042 (10jrbs) [18:59:16] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2001.codfw.wmnet [18:59:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [19:01:30] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet [19:02:50] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org [19:02:55] (03PS2) 10AOkoth: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) [19:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:05:06] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:05:10] 10SRE-tools, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848 (10ssingh) 03NEW [19:05:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10774076 (10ssingh) p:05Triage→03Low [19:05:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P75546 and previous config saved to /var/cache/conftool/dbconfig/20250428-190550-fceratto.json [19:05:57] (03CR) 10Dzahn: "Yes, now it matches what it does. technically there is no need to add the aphlict2001.yaml at all.. since present is default. But if you a" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [19:06:54] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet [19:07:02] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:07:02] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:07:34] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir(3|4|5|6|7)001.* [19:07:35] (03CR) 10Dzahn: "nitpick: not needed to allow failover.. and we could also just leave the service running on both..DNS switch alone should do it. but it do" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [19:08:10] FIRING: BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:08:30] (03PS3) 10Dzahn: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [19:09:11] (03PS4) 10Dzahn: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [19:09:41] (03CR) 10Dzahn: [C:03+1] "+1 but only AFTER DNS change" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [19:13:02] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:04] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:25] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:28] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3003.* [19:17:21] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org [19:18:10] RESOLVED: BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:18:10] (03CR) 10Bking: [C:03+1] "Matches plan outlined in ticket" [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [19:18:19] (03PS1) 10Ssingh: hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) [19:19:18] (03PS2) 10Ssingh: hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) [19:20:43] (03CR) 10BCornwall: [C:03+1] hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:20:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:20:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P75547 and previous config saved to /var/cache/conftool/dbconfig/20250428-192057-fceratto.json [19:21:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:45] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir3003.esams.wmnet [19:23:09] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir[4-7]001.* [19:24:37] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir[3-7]002.* [19:24:51] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3004.* [19:27:22] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84280MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [19:28:11] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir.* [19:28:52] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for 14 hosts [19:29:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts [19:30:50] (03CR) 10Dzahn: [C:03+1] gitlab: use read-only object storage credentials on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [19:32:21] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org [19:34:15] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: upgrades [19:35:51] (03PS1) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) [19:35:54] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:36:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75548 and previous config saved to /var/cache/conftool/dbconfig/20250428-193605-fceratto.json [19:36:10] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:36:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [19:36:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75549 and previous config saved to /var/cache/conftool/dbconfig/20250428-193632-fceratto.json [19:38:10] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:39:13] (03PS1) 10BCornwall: acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549 [19:40:10] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:54] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:41:10] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:41:31] (03CR) 10Ssingh: [C:03+1] acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall) [19:42:37] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org [19:43:35] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5378/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall) [19:45:42] (03CR) 10BCornwall: [V:03+1 C:03+2] acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall) [19:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75550 and previous config saved to /var/cache/conftool/dbconfig/20250428-194708-fceratto.json [19:48:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:52:10] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:10] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:55:22] !log Upgrade/reboot acme-chief servers [19:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:37] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org [19:58:49] 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851 (10RobH) 03NEW [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2000). [20:00:05] danisztls, bwang, and bd808: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] o/ [20:00:45] 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851#10774204 (10RobH) a:03ssingh @ssingh, We didn't get racking details on ordering task T389840, so can you populate the racking details on this racking task? Additionally, please update the site.... [20:01:11] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:01:22] o/ [20:01:30] 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851#10774214 (10RobH) [20:02:14] o/ [20:02:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P75551 and previous config saved to /var/cache/conftool/dbconfig/20250428-200215-fceratto.json [20:03:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:04:03] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:59] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:20] danisztls and bwang: I can do the needful since it looks like the other deployers aren't here at the moment. [20:06:10] It doesn't look like any of our changes are easily testable on the staging servers. [20:07:11] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:07:13] bd808: yes, thanks [20:10:47] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet [20:10:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet [20:10:55] woah. what's this massive pile of "No space left on device" errors? [20:11:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2100 [20:11:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2100 [20:11:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [20:11:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774239 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [20:11:25] FIRING: [8x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:51] bd808: mwmaint1002 ? [20:11:54] zabe: are you around? It looks like your job that is running migrateESRefToContentTable.php is having a really bad time. [20:12:10] dancy: yeah [20:12:14] That's T392834 [20:12:15] T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834 [20:13:09] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org [20:13:16] 454,813 events for it in logspam-watch [20:13:23] oof [20:13:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:15:04] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774253 (10bd808) There are 454,813 "PHP Notice: fwrite(): write of 63 bytes failed with errno=28 No space left on device" errors in `logspam-watch` right now. It looks like the `extensions/WikimediaMainten... [20:16:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:28] ok. the logspam looks unrelated to prod wikis, so lets get on with backports [20:16:55] do not worry about home dirs as long as we have this: [20:16:56] 24G mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 [20:16:56] 25G mediawiki_job_growthexperiments-refreshLinkRecommendations-s3 [20:17:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P75552 and previous config saved to /var/cache/conftool/dbconfig/20250428-201723-fceratto.json [20:18:01] * bd808 is about to click SpiderPig's "Start Backport" button for his first time outside of local dev testing [20:18:09] Woohoo! [20:18:14] 🕸️ [20:18:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [20:18:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:18:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [20:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:18:59] oooh going for a triple [20:19:02] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774270 (10Dzahn) It's almost entirely just logs from the growth experiments jobs. and under /var/log/ ` 24G mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 25G mediawiki_job_growthexperimen... [20:19:19] yeah, they are all config only and none of them are really testable [20:20:01] (03Merged) 10jenkins-bot: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [20:20:04] (03Merged) 10jenkins-bot: Design Research Participant Survey: Increase Coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:20:07] (03Merged) 10jenkins-bot: Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [20:20:22] !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]] [20:20:29] T392142: Office Wiki credentials inexplicably stop working - https://phabricator.wikimedia.org/T392142 [20:20:29] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:20:29] T388719: Clean up Search AB test code - https://phabricator.wikimedia.org/T388719 [20:21:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:35] (03CR) 10BCornwall: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [20:24:15] jhancock@cumin2002 reimage (PID 1302317) is awaiting input [20:24:36] (03CR) 10Dzahn: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [20:25:02] !log bd808@deploy1003 dani, bwang, bd808: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:43] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2100.codfw.wmnet with reason: host reimage [20:28:09] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org [20:28:20] !log bd808@deploy1003 dani, bwang, bd808: Continuing with sync [20:30:05] !log mwmaint1002 - manually gzipped some syslog.1 file from growthexperiment jobs that used up all disk space - systemctl start logrotate T392834 [20:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:11] T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834 [20:31:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:59] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:32:21] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:32:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75553 and previous config saved to /var/cache/conftool/dbconfig/20250428-203230-fceratto.json [20:32:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:32:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2100.codfw.wmnet with reason: host reimage [20:32:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75554 and previous config saved to /var/cache/conftool/dbconfig/20250428-203255-fceratto.json [20:34:22] RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2093 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1704, active_shards: 5111, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number [20:34:22] ing_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:34:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2093-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:34:57] !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]] (duration: 14m 34s) [20:35:03] T392142: Office Wiki credentials inexplicably stop working - https://phabricator.wikimedia.org/T392142 [20:35:04] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:35:04] T388719: Clean up Search AB test code - https://phabricator.wikimedia.org/T388719 [20:35:37] danisztls and bwang: Your changes are live on the project wikis [20:36:07] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774417 (10Tgr) The mediawiki_job_growthexperiments-refreshLinkRecommendations-* logs should be fine to delete, if you are looking for some emergency space savings. It's the output of a job creating seconda... [20:36:25] FIRING: [9x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:20] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:37:55] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774427 (10Tgr) Logrotate should probably enforce some default storage quota for jobs. [20:38:00] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:38:19] everything looks normal on the error log watching places other than the T392834 stuff that is unrelated to the backports [20:38:19] Ok thank you [20:38:19] T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834 [20:38:41] * bd808 declares the backport window closed [20:40:45] dancy: a thought for SpiderPig -- what is the `scap backport --revert` story there? I think the answer is use ssh and scap on the cli, but maybe I'm missing something? [20:41:19] bd808: Make it easy to revert is on the list of improvements. [20:41:25] FIRING: [9x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:27] RESOLVED: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cirrussearch2093:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:42:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75555 and previous config saved to /var/cache/conftool/dbconfig/20250428-204219-fceratto.json [20:42:51] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774453 (10Dzahn) Thanks for confirming that. I deleted the 2 largest syslog files, from mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 and mediawiki_job_growthexperiments-refreshLinkRecommen... [20:43:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org [20:45:12] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774460 (10Dzahn) Stopping the services `mediawiki_job_growthexperiments-refreshLinkRecommendations-s2` and `mediawiki_job_growthexperiments-refreshLinkRecommendations-s3` also does not properly shut them d... [20:45:37] tgr_: looks like we'd have to manually kill processes to stop that [20:45:54] /bin/sh -c /usr/local/bin/foreachwikiindblist 'growthexperiments & s2' .... does not go away [20:46:25] RESOLVED: [7x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2046.codfw.wmnet with OS bookworm [20:53:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [20:53:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm [20:53:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [20:53:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm [20:53:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [20:53:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err... [20:53:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm [20:54:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [20:54:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [20:56:54] ok, mwmaint1002 disk issue resolved for now [20:57:14] had to also restart rsyslogd which kept deleted huge logs open and stuff [20:57:23] usage on / back to 60% [20:57:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P75556 and previous config saved to /var/cache/conftool/dbconfig/20250428-205727-fceratto.json [20:57:59] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774533 (10Dzahn) Killed the processes for growthexperiments-refreshLinkRecommendations-s2 and growthexperiments-refreshLinkRecommendations-s3. gzipped more syslog files. Still not a lot of space. rsysl... [20:58:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2100.codfw.wmnet with OS bullseye [20:58:43] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org [21:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2100) [21:02:48] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:03:00] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:03:02] 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774543 (10Dzahn) ` kill 24015 kill 24047 ` ` systemctl start logrotate .. systemctl start prometheus-dpkg-success-textfile.service .. start prometheus_intel_microcode.service .. systemctl start prometheus-... [21:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:04:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2046.codfw.wmnet with reason: host reimage [21:04:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage [21:05:52] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [21:08:00] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:08:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2046.codfw.wmnet with reason: host reimage [21:12:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P75557 and previous config saved to /var/cache/conftool/dbconfig/20250428-211234-fceratto.json [21:14:21] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org [21:14:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and not (A:eqiad or A:codfw) and A:dnsbox [21:15:15] 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774578 (10Dzahn) [21:15:50] 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774590 (10Dzahn) almost certainly caused by T392834 [21:16:30] 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774594 (10Dzahn) after mwmaint1002 has some disk space again. now: ` [mwmaint1002:~] $ sql centralauth ... Welcome to the MariaDB monitor. Commands end with ; or \g. ` [21:17:07] 06SRE, 06serviceops-radar: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774595 (10Dzahn) [21:20:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774597 (10ssingh) [21:20:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774599 (10ssingh) a:05ssingh→03BCornwall [21:23:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:23:54] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and A:dnsbox [21:23:54] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org [21:24:32] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774617 (10ssingh) Thanks @RobH. Task assigned to Traffic and hostnames updated. We will take care of the preseed.yaml bit, thanks for the reminder! [21:26:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:26:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2046.codfw.wmnet with OS bookworm [21:26:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm completed: - gane... [21:27:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75558 and previous config saved to /var/cache/conftool/dbconfig/20250428-212741-fceratto.json [21:28:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [21:28:00] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:28:00] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:28:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75559 and previous config saved to /var/cache/conftool/dbconfig/20250428-212806-fceratto.json [21:28:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774626 (10Jhancock.wm) [21:28:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [21:28:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [21:31:00] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:31:00] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:31:41] Hey all - we have two security patches going out for the window toda.y [21:34:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774644 (10ssingh) (Scratch that, preseed.yaml is `cp[1-9][0-9][0-9][0-9]` so that's good but we just need to update site.pp) [21:35:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:36:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75560 and previous config saved to /var/cache/conftool/dbconfig/20250428-213601-fceratto.json [21:36:29] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org [21:36:29] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and A:dnsbox [21:36:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774654 (10BCornwall) [21:39:56] (03PS1) 10BCornwall: site.pp: Include new codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851) [21:42:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:42:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:46:56] !log Deployed security fix for T385792 [21:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:47:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:48:20] (03PS6) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) [21:48:20] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1137840/5383/" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [21:48:42] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:51:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P75561 and previous config saved to /var/cache/conftool/dbconfig/20250428-215107-fceratto.json [21:51:30] (03PS7) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) [21:56:03] jhancock@cumin2002 reimage (PID 1409724) is awaiting input [22:03:19] !log Deployed security fix for T392276 [22:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P75562 and previous config saved to /var/cache/conftool/dbconfig/20250428-220615-fceratto.json [22:09:43] 06SRE, 06serviceops-radar: Cannot connect to MariaDB server from mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774830 (10Reedy) [22:13:42] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:40] !log Deployed security fix for T391343 [22:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75563 and previous config saved to /var/cache/conftool/dbconfig/20250428-222122-fceratto.json [22:21:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [22:21:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75564 and previous config saved to /var/cache/conftool/dbconfig/20250428-222148-fceratto.json [22:23:37] (03PS6) 10BryanDavis: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 [22:29:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75566 and previous config saved to /var/cache/conftool/dbconfig/20250428-222946-fceratto.json [22:30:07] (03CR) 10Bking: [C:03+2] Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [22:31:24] (03CR) 10Bking: [C:03+2] "I built and deployed the deb mentioned in Ebernhardson's comment, so we are good to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [22:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:06] (03CR) 10BryanDavis: [C:04-1] "I need help thinking about https://phabricator.wikimedia.org/P75565 and how to handle the logic inversion I am doing in the Beta Cluster w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [22:38:47] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:09] (03PS1) 10Bking: Revert "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139565 [22:40:26] (03CR) 10Bking: [V:03+2 C:03+2] Revert "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139565 (owner: 10Bking) [22:44:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P75567 and previous config saved to /var/cache/conftool/dbconfig/20250428-224453-fceratto.json [22:48:42] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:42] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10774902 (10VRiley-WMF) 05Open→03Resolved Dell was onsite today and replaced the motherboard, moved DIMMs around, replaced cables and replaced a CPU. Heres hoping we can finally close this ticket,... [23:00:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P75568 and previous config saved to /var/cache/conftool/dbconfig/20250428-230001-fceratto.json [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2300) [23:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:04:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [23:05:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [23:11:24] (03CR) 10BryanDavis: [C:04-1] "Paying more attention, there are currently only 2 Beta Cluster wikis that end up with unexpected config:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [23:13:42] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75569 and previous config saved to /var/cache/conftool/dbconfig/20250428-231508-fceratto.json [23:15:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:15:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75570 and previous config saved to /var/cache/conftool/dbconfig/20250428-231534-fceratto.json [23:30:23] 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774930 (10bd808) p:05Unbreak!→03High Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and others to decide... [23:35:05] 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774935 (10Zabe) >>! In T392834#10773349, @elukey wrote: > ` > elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail > 553M /home/ebernhardson > 842M /home/catrope > 1.2G /hom... [23:39:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571 [23:39:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571 (owner: 10TrainBranchBot) [23:45:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75571 and previous config saved to /var/cache/conftool/dbconfig/20250428-234542-fceratto.json [23:47:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm [23:47:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err... [23:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:50:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571 (owner: 10TrainBranchBot) [23:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:07] !log zabe@mwmaint1002:~$ mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php enwiki --deletedump /home/zabe/afl_text_table_deletedump/enwiki --dump /home/zabe/afl_text_table_dump/enwiki --sleep 0.5 # T381599 [23:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:11] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599