[00:09:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210
[00:09:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210 (owner: 10TrainBranchBot)
[00:11:06] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:11:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:21:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:34:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139210 (owner: 10TrainBranchBot)
[00:40:56] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:46:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:49:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:54:17] <jinxer-wm>	 FIRING: JobQueueLowTrafficConsumerWidespreadHighLatency: ...
[00:54:18] <jinxer-wm>	 Processing delay times for low-traffic consumer rules are unusually high - https://wikitech.wikimedia.org/wiki/MediaWiki_JobQueue/Operations#JobQueueLowTrafficConsumerWidespreadHighLatency - https://grafana.wikimedia.org/d/fe130675-0c2d-4991-9dec-f54cf6a9c4d8/jobqueue-low-traffic-jobs?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DJobQueueLowTrafficConsumerWidespreadHighLatency
[00:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[01:34:17] <jinxer-wm>	 RESOLVED: JobQueueLowTrafficConsumerWidespreadHighLatency: ...
[01:34:18] <jinxer-wm>	 Processing delay times for low-traffic consumer rules are unusually high - https://wikitech.wikimedia.org/wiki/MediaWiki_JobQueue/Operations#JobQueueLowTrafficConsumerWidespreadHighLatency - https://grafana.wikimedia.org/d/fe130675-0c2d-4991-9dec-f54cf6a9c4d8/jobqueue-low-traffic-jobs?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DJobQueueLowTrafficConsumerWidespreadHighLatency
[01:40:56] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:42:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:43:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:50:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:09:08] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:43:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[02:57:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[03:28:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:29:22] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[03:53:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:46] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:46:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:49:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:16:57] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[05:20:08] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662)
[05:28:08] <wikibugs>	 (03PS1) 10Marostegui: es1029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139215 (https://phabricator.wikimedia.org/T391921)
[05:28:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1029 T391921', diff saved to https://phabricator.wikimedia.org/P75466 and previous config saved to /var/cache/conftool/dbconfig/20250428-052817-marostegui.json
[05:28:23] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[05:28:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 and es2028 T391921', diff saved to https://phabricator.wikimedia.org/P75467 and previous config saved to /var/cache/conftool/dbconfig/20250428-052836-marostegui.json
[05:29:03] <wikibugs>	 (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[05:29:05] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1029.eqiad.wmnet with reason: Maintenance
[05:29:26] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: Maintenance
[05:30:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1029: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139215 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:32:51] <wikibugs>	 (03PS1) 10Marostegui: es2028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139216 (https://phabricator.wikimedia.org/T391921)
[05:33:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2028: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139216 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:35:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75468 and previous config saved to /var/cache/conftool/dbconfig/20250428-053532-root.json
[05:38:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75469 and previous config saved to /var/cache/conftool/dbconfig/20250428-053829-root.json
[05:42:07] <marostegui>	 !log Migrate es1029 and es2028 to MariaDB 10.11 T391921
[05:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:12] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[05:43:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:47:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030 and es2026 T391921', diff saved to https://phabricator.wikimedia.org/P75470 and previous config saved to /var/cache/conftool/dbconfig/20250428-054741-marostegui.json
[05:47:47] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[05:48:28] <wikibugs>	 (03PS1) 10Marostegui: es1030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139218 (https://phabricator.wikimedia.org/T391921)
[05:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:48:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2026.codfw.wmnet,es1030.eqiad.wmnet with reason: Maintenance
[05:49:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139218 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:50:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75471 and previous config saved to /var/cache/conftool/dbconfig/20250428-055038-root.json
[05:53:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75472 and previous config saved to /var/cache/conftool/dbconfig/20250428-055335-root.json
[05:54:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75473 and previous config saved to /var/cache/conftool/dbconfig/20250428-055411-root.json
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:05:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75474 and previous config saved to /var/cache/conftool/dbconfig/20250428-060543-root.json
[06:08:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75476 and previous config saved to /var/cache/conftool/dbconfig/20250428-060840-root.json
[06:09:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75477 and previous config saved to /var/cache/conftool/dbconfig/20250428-060916-root.json
[06:16:03] <kart_>	 Deploying cxserver. 
[06:16:33] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662) (owner: 10KartikMistry)
[06:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-04-25-063512-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139214 (https://phabricator.wikimedia.org/T392662) (owner: 10KartikMistry)
[06:20:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75480 and previous config saved to /var/cache/conftool/dbconfig/20250428-062049-root.json
[06:22:24] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:22:48] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:23:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75481 and previous config saved to /var/cache/conftool/dbconfig/20250428-062346-root.json
[06:24:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75482 and previous config saved to /var/cache/conftool/dbconfig/20250428-062422-root.json
[06:25:19] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:25:52] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:26:51] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:27:26] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:29:11] <kart_>	 !log Updated cxserver to 2025-04-25-063512-production (T392662)
[06:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:16] <stashbot>	 T392662: /v2/suggest/sections/{title}/{from}/{to}: Error in your SQL syntax; check for the right syntax to use near ') - https://phabricator.wikimedia.org/T392662
[06:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75484 and previous config saved to /var/cache/conftool/dbconfig/20250428-063555-root.json
[06:38:11] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: disable ci_secure_files object storage [puppet] - 10https://gerrit.wikimedia.org/r/1139007 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[06:38:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway)
[06:38:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75485 and previous config saved to /var/cache/conftool/dbconfig/20250428-063851-root.json
[06:39:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75486 and previous config saved to /var/cache/conftool/dbconfig/20250428-063927-root.json
[06:39:44] <wikibugs>	 (03PS1) 10Marostegui: es2026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139309 (https://phabricator.wikimedia.org/T391921)
[06:40:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139309 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[06:43:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[06:43:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75487 and previous config saved to /var/cache/conftool/dbconfig/20250428-064356-root.json
[06:50:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add trixie to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/1139037 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[06:51:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75489 and previous config saved to /var/cache/conftool/dbconfig/20250428-065100-root.json
[06:52:37] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139016 (owner: 10Majavah)
[06:53:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75490 and previous config saved to /var/cache/conftool/dbconfig/20250428-065357-root.json
[06:54:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: statistics: add statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599)
[06:54:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75491 and previous config saved to /var/cache/conftool/dbconfig/20250428-065433-root.json
[06:55:53] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::prometheus::k8s: drop istio gateway labels for ML [puppet] - 10https://gerrit.wikimedia.org/r/1138313 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[06:56:25] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350)
[06:57:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:59:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75493 and previous config saved to /var/cache/conftool/dbconfig/20250428-065901-root.json
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:06:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75494 and previous config saved to /var/cache/conftool/dbconfig/20250428-070606-root.json
[07:09:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75495 and previous config saved to /var/cache/conftool/dbconfig/20250428-070902-root.json
[07:09:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75496 and previous config saved to /var/cache/conftool/dbconfig/20250428-070939-root.json
[07:12:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790 (10MoritzMuehlenhoff) 03NEW
[07:14:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75497 and previous config saved to /var/cache/conftool/dbconfig/20250428-071408-root.json
[07:18:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] role: remove logstash role files [puppet] - 10https://gerrit.wikimedia.org/r/1138756 (owner: 10Filippo Giunchedi)
[07:19:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Expand GitLab blocklist for new WMCS IP space [puppet] - 10https://gerrit.wikimedia.org/r/1139016 (owner: 10Majavah)
[07:21:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75498 and previous config saved to /var/cache/conftool/dbconfig/20250428-072111-root.json
[07:24:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75499 and previous config saved to /var/cache/conftool/dbconfig/20250428-072408-root.json
[07:24:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "You are right, I was too hasty on this and jumped the gun on this. I'll abandon the review for now and we can keep iterating on T391687 wh" [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi)
[07:24:26] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: logstash: bump shards for logstash-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi)
[07:24:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75500 and previous config saved to /var/cache/conftool/dbconfig/20250428-072444-root.json
[07:25:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790)
[07:29:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75501 and previous config saved to /var/cache/conftool/dbconfig/20250428-072914-root.json
[07:29:17] <godog>	 !log upgrade thanos to 0.38 on titan1* - T383966
[07:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:21] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[07:33:29] <wikibugs>	 (03PS1) 10Marostegui: instance.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530)
[07:34:47] <godog>	 !log upgrade thanos to 0.38 on titan2* - T383966
[07:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:51] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[07:36:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75502 and previous config saved to /var/cache/conftool/dbconfig/20250428-073617-root.json
[07:37:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update Gerrit IPs in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793)
[07:39:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75503 and previous config saved to /var/cache/conftool/dbconfig/20250428-073914-root.json
[07:39:40] <wikibugs>	 (03PS1) 10Majavah: cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793)
[07:39:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75504 and previous config saved to /var/cache/conftool/dbconfig/20250428-073950-root.json
[07:43:46] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729)
[07:44:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75506 and previous config saved to /var/cache/conftool/dbconfig/20250428-074419-root.json
[07:46:34] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:48:24] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:48:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:53:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:53:46] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:53:51] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:54:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75508 and previous config saved to /var/cache/conftool/dbconfig/20250428-075455-root.json
[07:56:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff)
[07:59:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75509 and previous config saved to /var/cache/conftool/dbconfig/20250428-075924-root.json
[07:59:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10771657 (10MoritzMuehlenhoff)
[08:01:41] <wikibugs>	 (03PS1) 10Ayounsi: gNMIc: collect transceivers states [puppet] - 10https://gerrit.wikimedia.org/r/1139410 (https://phabricator.wikimedia.org/T388641)
[08:01:43] <wikibugs>	 (03PS1) 10Ayounsi: Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380)
[08:02:05] <wikibugs>	 (03PS2) 10Ayounsi: Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380)
[08:02:18] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi)
[08:06:08] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:06:58] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:08:03] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, the old ip points to:" [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:08:59] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:11:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:12:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add component/puppet7 for trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1139314 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff)
[08:12:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[08:12:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[08:14:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75510 and previous config saved to /var/cache/conftool/dbconfig/20250428-081430-root.json
[08:14:59] <moritzm>	 !log installing Linux 6.1.135 on Bookworm hosts
[08:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:20:24] <taavi>	 jouncebot: nowandnext
[08:20:24] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 39 minute(s)
[08:20:24] <jouncebot>	 In 1 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000)
[08:21:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah)
[08:22:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah)
[08:23:33] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]]
[08:23:37] <stashbot>	 T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689
[08:23:53] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10771693 (10tappof) Hey @wiki_willy, thanks for the feedback! I'll take a look at your request and let you know.
[08:29:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75511 and previous config saved to /var/cache/conftool/dbconfig/20250428-082935-root.json
[08:30:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[08:37:10] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:37:18] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "thx!" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:38:37] <logmsgbot>	 !log taavi@deploy1003 taavi: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:38:43] <stashbot>	 T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689
[08:39:52] <logmsgbot>	 !log taavi@deploy1003 taavi: Continuing with sync
[08:43:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Cool." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138827 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol)
[08:44:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75513 and previous config saved to /var/cache/conftool/dbconfig/20250428-084440-root.json
[08:46:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:47:31] <moritzm>	 !log installing avahi security updates
[08:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:31] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on krb1002.eqiad.wmnet with reason: work in progress, not yet active
[08:49:20] <logmsgbot>	 !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137732|Add WMCS ranges to wgAutoblockExemptions (T386689)]] (duration: 25m 46s)
[08:49:24] <stashbot>	 T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689
[08:49:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:51:16] <wikibugs>	 (03Merged) 10jenkins-bot: cr-cloud: Update Gerrit addressing [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:54:30] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796 (10MatthewVernon) 03NEW
[08:55:11] <taavi>	 !log update cr-cloud firewall policy for new gerrit ip address T392793
[08:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[08:55:27] <hnowlan>	 jouncebot: nowandnext
[08:55:27] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 4 minute(s)
[08:55:27] <jouncebot>	 In 1 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000)
[08:57:09] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update Gerrit IPs in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1139403 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[08:58:36] <dcausse>	 !log restarting blazegraph on wdqs1013 (deadlocked)
[08:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:34] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:wmcs: maintain_dbusers: Use cloud-private for ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1139019 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah)
[09:00:40] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge: disable_tool: Use ToolsDB internal IP instead [puppet] - 10https://gerrit.wikimedia.org/r/1139018 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah)
[09:01:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: raise rule log level to avoid log spam [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966)
[09:01:09] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: disable_tool: Use ToolsDB internal IP instead [puppet] - 10https://gerrit.wikimedia.org/r/1139018 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah)
[09:02:28] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:02:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[09:02:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[09:02:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[09:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:04:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:08:21] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799)
[09:08:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[09:09:59] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799)
[09:10:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi)
[09:10:19] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10771815 (10MatthewVernon) p:05Triage→03High
[09:10:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] envoyproxy: tweak default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi)
[09:15:10] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797)
[09:17:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[09:20:44] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:20:44] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:21:44] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:21:44] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:22:15] <wikibugs>	 (03PS1) 10Samtar: errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692)
[09:25:29] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10771849 (10Silvan_WMDE) @Kirilloparma We have created and merged a patch that will ho...
[09:27:07] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10771861 (10elukey) @herron now the citoid definition uses "raw" istio metrics, and from https://thanos.wikimedia.org/rules it seems that we are ranging at aro...
[09:28:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[09:28:08] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[09:30:02] <wikibugs>	 (03Abandoned) 10Hnowlan: Revert "debug: reorder debug backends for eqiad switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129297 (owner: 10Hnowlan)
[09:34:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks good to me insofar as I understand it (very little ^^). Do we need to configure Prometheus to pull from this new exporter or will th" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi)
[09:34:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: raise rule log level to avoid log spam [puppet] - 10https://gerrit.wikimedia.org/r/1139414 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[09:37:28] <wikibugs>	 (03CR) 10Hasan Akgün (WMDE): "Same here, imo it's not a blocker for this patch to process but still something we should consider" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi)
[09:38:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:40:20] <wikibugs>	 (03Abandoned) 10Hnowlan: trafficserver: remove restbase from citoid request path everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1124418 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[09:41:51] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from a TSP point of view." [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[09:42:32] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724)
[09:43:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:43:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:46:24] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5
[09:47:45] <wikibugs>	 (03Abandoned) 10Hnowlan: mediawiki: miscellaneous bits of jobrunner cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1117525 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[09:48:30] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs: maintain_dbusers: Use cloud-private for ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1139019 (https://phabricator.wikimedia.org/T381272) (owner: 10Majavah)
[09:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:50:34] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10771922 (10LSobanski)
[09:51:10] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis nupwiki in section s5
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:53:34] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet
[09:53:53] <wikibugs>	 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10771944 (10ops-monitoring-bot) VM ml-staging-ctrl2002.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory
[09:54:03] <elukey>	 !log increase vcores and memory available for ml-staging-ctrl2* - T392289#10771944
[09:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:07] <stashbot>	 T392289: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289
[09:55:12] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5
[09:56:23] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800)
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[09:57:37] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800)
[09:58:31] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000)
[10:01:59] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[10:02:21] <wikibugs>	 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10771964 (10ops-monitoring-bot) VM ml-staging-ctrl2001.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory
[10:03:41] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5
[10:04:40] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5
[10:04:51] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536)
[10:06:55] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[10:09:13] <logmsgbot>	 fceratto@cumin1002 sanitize-wiki (PID 3414639) is awaiting input
[10:10:48] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5
[10:18:46] <wikibugs>	 (03PS3) 10Hashar: gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669)
[10:19:52] <wikibugs>	 (03CR) 10Hashar: "Rebased due to the parent change ( Ib2302cc1ff7b49f58bac0eab8eea7c1fe68e62ea" [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[10:19:53] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Migrate MediaWiki.wikibase.* stats [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251)
[10:20:16] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253)
[10:20:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE))
[10:20:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE))
[10:23:35] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867)
[10:24:52] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[10:24:57] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan)
[10:25:12] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan)
[10:25:13] <wikibugs>	 (03PS1) 10Muehlenhoff: kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429
[10:26:30] <wikibugs>	 (03CR) 10Hashar: "I had the search console manually enabled to get access to the Google crawling dashboard and then attempt to fine tune what it is crawling" [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[10:32:27] <TheresNoTime>	 jouncebot: nowandnext
[10:32:27] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1000)
[10:32:27] <jouncebot>	 In 2 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300)
[10:32:37] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5
[10:34:46] <wikibugs>	 (03PS5) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975)
[10:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:35:48] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Configure statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344)
[10:36:16] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I hope this is the right host and port…" [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE))
[10:40:34] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5
[10:41:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar)
[10:42:39] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar)
[10:42:53] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]]
[10:42:58] <stashbot>	 T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975
[10:43:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[10:44:13] <wikibugs>	 (03PS1) 10Hnowlan: mw:maintenance:updatequerypages: move all ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534)
[10:44:40] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[10:46:33] <wikibugs>	 (03PS2) 10Hnowlan: mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534)
[10:47:30] <wikibugs>	 (03PS1) 10Awight: Revert "Temporarily revoke ssh key for travel" [puppet] - 10https://gerrit.wikimedia.org/r/1139434
[10:47:40] <logmsgbot>	 !log samtar@deploy1003 samtar: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:48:31] <logmsgbot>	 !log samtar@deploy1003 samtar: Continuing with sync
[10:48:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429 (owner: 10Muehlenhoff)
[10:53:29] <wikibugs>	 (03PS1) 10Ozge: feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436
[10:55:02] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436 (owner: 10Ozge)
[10:55:12] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134771|InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki (T377975)]] (duration: 12m 18s)
[10:55:16] <stashbot>	 T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975
[10:56:50] <wikibugs>	 (03Merged) 10jenkins-bot: feat: adds articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139436 (owner: 10Ozge)
[10:58:47] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534)
[10:58:49] <wikibugs>	 (03PS1) 10Hnowlan: mw:maintenance: migrate all updatequerypages_ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139438 (https://phabricator.wikimedia.org/T388534)
[10:59:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:00:15] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:00:26] <wikibugs>	 (03CR) 10Hnowlan: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:06:33] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534)
[11:08:46] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772066 (10Silvan_WMDE) Until then: as noted above, the problem is not actually cause...
[11:12:23] <wikibugs>	 (03PS3) 10Hnowlan: mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534)
[11:13:07] <wikibugs>	 (03PS1) 10Ladsgroup: EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784)
[11:17:15] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:22:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester)
[11:22:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester)
[11:23:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester)
[11:23:25] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mediawiki::maintenance: migrate main startupregistrystats job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139020 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan)
[11:23:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester)
[11:25:19] <Amir1>	 jouncebot: nowandnext
[11:25:19] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 34 minute(s)
[11:25:19] <jouncebot>	 In 1 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300)
[11:25:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup)
[11:26:06] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443
[11:27:07] <wikibugs>	 (03Merged) 10jenkins-bot: EventStore: Add caching for per-page event lookups [extensions/CampaignEvents] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139439 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup)
[11:28:42] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]]
[11:28:47] <stashbot>	 T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784
[11:28:48] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10772138 (10Jelto) I granted `gitlab-ro` read-only access to the GitLab object storage buckets `gitlab-packages` and `gitlab-artifa...
[11:29:19] <wikibugs>	 (03CR) 10Kamila Součková: "migration_title bad copypasta, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[11:30:08] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 267372
[11:30:09] <logmsgbot>	 !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:30:21] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267372
[11:30:37] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264195
[11:30:55] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264195
[11:30:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[11:31:10] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61622
[11:31:25] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61622
[11:31:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264544
[11:31:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester)
[11:31:52] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264544
[11:31:57] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 17072
[11:32:17] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17072
[11:32:23] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270589
[11:32:25] <wikibugs>	 (03PS4) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539)
[11:32:35] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270589
[11:32:39] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 274607
[11:33:20] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:33:40] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274607
[11:33:42] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[11:34:52] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Oh perfect thank you very much! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[11:35:24] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[11:38:21] <wikibugs>	 (03PS5) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539)
[11:40:57] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate purge_parsercache_pc1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[11:41:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Fastnetmon: permanently disable graphite [puppet] - 10https://gerrit.wikimedia.org/r/1139411 (https://phabricator.wikimedia.org/T228380) (owner: 10Ayounsi)
[11:41:26] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1139445
[11:41:57] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] (duration: 13m 15s)
[11:42:02] <stashbot>	 T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784
[11:42:32] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org
[11:43:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[11:43:57] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1139445 (owner: 10Hashar)
[11:44:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:45:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] kernel_report: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1139429 (owner: 10Muehlenhoff)
[11:45:57] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2003.wikimedia.org with reason: T392804
[11:47:34] <XioNoX>	 !log push pfw policies - T392617
[11:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:51:40] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/460c1e7e3fee2d2e7ca4826011b5e66a4a6e79366c44ff434ebfa90fdadea433/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[11:52:20] <moritzm>	 !log installing avahi security updates
[11:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[11:52:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[11:53:07] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org
[11:53:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:53:51] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:54] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1004.wikimedia.org
[11:57:49] <wikibugs>	 (03CR) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[11:58:30] <wikibugs>	 06SRE, 06cloud-services-team, 10Horizon, 06serviceops, 10Striker: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478#10772235 (10taavi) /cc @Andrew   Main thing to note here is that Horizon needs to be able to talk to cloud-realm services....
[11:58:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki::maintenance: migrate main startupregistrystats job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139020 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan)
[11:59:11] <wikibugs>	 (03PS1) 10Slyngshede: IDM/IDP: Patch management [dns] - 10https://gerrit.wikimedia.org/r/1139446
[12:01:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDM/IDP: Patch management [dns] - 10https://gerrit.wikimedia.org/r/1139446 (owner: 10Slyngshede)
[12:01:26] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[12:03:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan)
[12:04:01] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[12:04:21] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan)
[12:07:09] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp2004.wikimedia.org
[12:08:05] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[12:08:45] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device pfw1a-codfw
[12:09:42] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[12:10:54] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[12:11:00] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[12:11:00] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[12:11:08] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2004.wikimedia.org
[12:11:09] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-codfw
[12:11:24] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org
[12:11:40] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[12:13:25] <moritzm>	 !log installing sqlparse security updates
[12:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:23] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad2002.codfw.wmnet with reason: T392804
[12:15:20] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org
[12:15:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet
[12:19:21] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists2001.wikimedia.org with reason: T392804
[12:19:39] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450
[12:19:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet
[12:20:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10772298 (10fgiunchedi) I was curious too how trixie + puppet 8 would look like and did some work in that direction, you can find the patches at `sandbox/filippo/pontoon-t...
[12:21:17] <dcausse>	 !log repooling wdqs1013
[12:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:17] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "We always forget this 😞" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui)
[12:23:14] <wikibugs>	 (03CR) 10Marostegui: "I didn't forget it, but I prefer to do it in different CR" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui)
[12:24:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet
[12:24:06] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: prometheus: Remove duplication in relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/1139454
[12:24:06] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570)
[12:24:40] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I always do forget it 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui)
[12:25:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5369/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[12:28:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet
[12:32:30] <wikibugs>	 (03PS1) 10Ozge: feat: updates blubber yaml for  articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460
[12:33:51] <wikibugs>	 (03PS2) 10Ozge: feat: updates blubber yaml for  articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460 (https://phabricator.wikimedia.org/T391679)
[12:34:52] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] feat: updates blubber yaml for  articlequality_v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139460 (https://phabricator.wikimedia.org/T391679) (owner: 10Ozge)
[12:37:12] <logmsgbot>	 !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:37:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Thanks! I can keep an eye on it." [puppet] - 10https://gerrit.wikimedia.org/r/1139422 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[12:38:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] statistics: add statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi)
[12:38:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:38:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Prometheus will pick up metrics by itself, no need for "job" anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1139310 (https://phabricator.wikimedia.org/T392599) (owner: 10Filippo Giunchedi)
[12:41:00] <wikibugs>	 (03CR) 10David Caro: [C:03+1] P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450 (owner: 10Majavah)
[12:41:16] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs: toolsdb_replica_cnf: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1139450 (owner: 10Majavah)
[12:41:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#10772373 (10MoritzMuehlenhoff) In addition to the puppet-agent forward port two more packages need to be built: - puppet agent 7 needs ruby-concurrent 1.1.x (since 1.2.x has breaking...
[12:42:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10772374 (10MoritzMuehlenhoff)
[12:43:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] statistics::wmde: Configure statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1139431 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE))
[12:43:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:43:44] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:43:47] <moritzm>	 !log installing werkzeug security updates
[12:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:40] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:44:40] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5028 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:44:41] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:44:43] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:44:46] <sukhe>	 huh
[12:44:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet
[12:45:42] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5025 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:45:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:45:46] <sukhe>	 yeah
[12:45:48] <sukhe>	 !incidents
[12:45:48] <sirenbot>	 6056 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:45:52] <sukhe>	 !ack 6056
[12:45:52] <sirenbot>	 6056 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:46:27] <godog>	 can I help with the incident sukhe ? expected ?
[12:46:44] <sukhe>	 godog: no, not expected. a huge spike in upload@eqsin
[12:46:52] <sukhe>	 looking as soon as superset loads for me :]
[12:46:52] <godog>	 ack, checking too
[12:48:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job varnish-upload in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:49:53] <wikibugs>	 (03PS2) 10AOkoth: miscweb: change os-reports runtime owner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794)
[12:50:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:51:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet
[12:51:42] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5028 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:51:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:53:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job varnish-upload in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:57:33] <XioNoX>	 !log test `host-inbound-traffic system-services` on pfw1-codfw - T390052
[12:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:37] <stashbot>	 T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300).
[13:00:05] <jouncebot>	 tgr, Lucas_WMDE, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <Lucas_WMDE>	 o/
[13:00:19] <Lucas_WMDE>	 are we okay to deploy right now? cc godog sukhe 
[13:00:38] <James_F>	 (Hey.)
[13:01:00] <tgr_>	 I'll be here in half an hour, can self-deploy
[13:01:07] <godog>	 Lucas_WMDE: AFAICT yes, thanks for checking
[13:01:11] <Lucas_WMDE>	 ok thanks
[13:01:16] <Lucas_WMDE>	 I’ll start with my backports then
[13:01:20] <sukhe>	 yep
[13:01:39] <Lucas_WMDE>	 and use spiderpig again just for the heck of it
[13:02:10] <James_F>	 Ooh, fancy.
[13:02:23] <James_F>	 Lucas_WMDE: Want to sling out my backport at the same time? It's a trivial logspam fix.
[13:02:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE))
[13:02:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE))
[13:02:51] <James_F>	 And "no" is a reasonable response, i can do it myself after you if you want. :-)
[13:03:11] <Lucas_WMDE>	 James_F: sorry, I already started it now
[13:03:18] <Lucas_WMDE>	 and I think I’d prefer to do it separately
[13:03:19] <James_F>	 I see. No worries.
[13:03:21] <Lucas_WMDE>	 but I can at least take a look at it now ^^
[13:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:04:20] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5025 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[13:04:20] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5026 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[13:05:20] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[13:06:37] <sukhe>	 !log clearing up Icinga alerts on cp50*
[13:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:18] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[13:07:20] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5031 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[13:10:58] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:13:10] <wikibugs>	 (03PS5) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146)
[13:16:13] <wikibugs>	 (03PS3) 10Effie Mouzeli: Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[13:16:15] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[13:16:26] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[13:16:29] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[13:18:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:19:26] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[13:19:35] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate MediaWiki.wikibase.* stats [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139426 (https://phabricator.wikimedia.org/T359251) (owner: 10Lucas Werkmeister (WMDE))
[13:19:39] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib [extensions/Wikibase] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139427 (https://phabricator.wikimedia.org/T359253) (owner: 10Lucas Werkmeister (WMDE))
[13:19:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]]
[13:20:01] <stashbot>	 T359251: [REPO][SW][GRAFMIGR] (mw.track) Migrate MediaWiki.wikibase.repo.* to statslib - https://phabricator.wikimedia.org/T359251
[13:20:02] <stashbot>	 T359252: [GRAFMIGR] Migrate MediaWiki.wikibase.view.* to statslib - https://phabricator.wikimedia.org/T359252
[13:20:02] <stashbot>	 T359253: [CLIENT][SW][GRAFMIGR] Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib - https://phabricator.wikimedia.org/T359253
[13:24:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:24:29] <Lucas_WMDE>	 I can try to test a little bit
[13:25:58] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[13:26:56] <Lucas_WMDE>	 looks good!
[13:27:03] <Lucas_WMDE>	 I see something in https://thanos.wikimedia.org/graph?g0.expr=mediawiki_WikibaseRepo_EditEntity_attemptSave_duration_seconds_sum&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant=
[13:27:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[13:27:31] <Lucas_WMDE>	 James_F: want to start CI for your backport already? or do you want to do the config changes first?
[13:29:58] <James_F>	 Lucas_WMDE: Sure.
[13:30:14] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Fix: PHP Warning: Undefined array key "request" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester)
[13:30:57] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough
[13:31:28] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:36] <sukhe>	 ^ expected, reboots in progress 
[13:31:41] <sukhe>	 double so for the DNS ones starting soon
[13:32:40] <wikibugs>	 (03PS1) 10Esanders: Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815)
[13:32:46] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox
[13:32:46] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org
[13:32:48] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:33:28] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan)
[13:33:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:33:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:33:48] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:33:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139426|Migrate MediaWiki.wikibase.* stats (T359251 T359252)]], [[gerrit:1139427|Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib (T359253)]] (duration: 13m 52s)
[13:33:52] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[13:33:54] <stashbot>	 T359251: [REPO][SW][GRAFMIGR] (mw.track) Migrate MediaWiki.wikibase.repo.* to statslib - https://phabricator.wikimedia.org/T359251
[13:33:54] <stashbot>	 T359252: [GRAFMIGR] Migrate MediaWiki.wikibase.view.* to statslib - https://phabricator.wikimedia.org/T359252
[13:33:55] <stashbot>	 T359253: [CLIENT][SW][GRAFMIGR] Migrate MediaWiki.$prefix.wikibase.client.scribunto.* to statslib - https://phabricator.wikimedia.org/T359253
[13:33:58] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[13:34:07] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[13:34:34] <wikibugs>	 (03Merged) 10jenkins-bot: Fix: PHP Warning: Undefined array key "request" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) (owner: 10Jforrester)
[13:34:56] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[13:35:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.70 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:35:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:36:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:36:07] <Lucas_WMDE>	 sorry, I got distracted for a second
[13:36:12] <Lucas_WMDE>	 James_F: you’re good to go
[13:36:16] <Lucas_WMDE>	 unless you want me to do the deploy
[13:36:17] <James_F>	 Ack.
[13:36:20] <James_F>	 I'll do it.
[13:36:22] <Lucas_WMDE>	 ok
[13:36:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:37:12] <sukhe>	 !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1138823"'
[13:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:20] <James_F>	 tgr_: Did you want me to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1136132 whilst I'm at it?
[13:37:43] <James_F>	 Oh, wait, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1135060 isn't in prod at all yet, I presume this should wait?
[13:38:10] <wikibugs>	 (03CR) 10Jforrester: "https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1135060 only just landed last week; do this need to wait until that is everywhere (wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza)
[13:38:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester)
[13:38:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester)
[13:38:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester)
[13:39:30] <wikibugs>	 (03Merged) 10jenkins-bot: Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester)
[13:39:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet
[13:40:10] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:40:11] <wikibugs>	 (03Merged) 10jenkins-bot: manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester)
[13:40:15] <wikibugs>	 (03Merged) 10jenkins-bot: nupwiki: Enable Parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester)
[13:40:52] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]]
[13:40:59] <stashbot>	 T392026: PHP Warning: Undefined array key "request" - https://phabricator.wikimedia.org/T392026
[13:40:59] <stashbot>	 T376827: Add a new checklist item to the Wiki creation process for Parsoid Read Views - https://phabricator.wikimedia.org/T376827
[13:41:00] <stashbot>	 T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384
[13:41:55] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7003.magru.wmnet} and A:liberica
[13:42:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7003.magru.wmnet} and A:liberica
[13:43:03] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: add conditional to enable ECH (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:43:08] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:08] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:08] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:08] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet
[13:43:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[13:43:48] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:48] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet
[13:44:18] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:44:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet
[13:44:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2092 to cirrussearch2092
[13:45:03] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[13:45:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:45:10] <jinxer-wm>	 RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:45:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[13:45:24] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]] synced to the testser
[13:45:24] <logmsgbot>	 vers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:46:10] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[13:47:43] <tgr_>	 James_F: uh yeah, I din't think that one through
[13:47:47] <wikibugs>	 (03PS1) 10DDesouza: Design Research Participant Survey: Increase Coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325)
[13:47:48] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:47:48] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:47:49] <tgr_>	 I'll move it to next week
[13:47:56] <James_F>	 tgr_: No worries, I didn't merge it anyway. :-)
[13:48:00] <James_F>	 <3
[13:48:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza)
[13:48:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet
[13:48:48] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns1004.wikimedia.org
[13:48:48] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:48:48] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:48:48] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns1004.wikimedia.org
[13:49:14] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org
[13:49:26] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org [reason: reboot finished]
[13:49:34] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:49:44] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:49:46] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:49:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2092 to cirrussearch2092 - bking@cumin2002"
[13:49:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[13:50:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet
[13:50:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet
[13:50:14] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum3003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:50:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet
[13:51:30] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: testing ECH
[13:51:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2092 to cirrussearch2092 - bking@cumin2002"
[13:51:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:51:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2092
[13:51:59] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move remaining pagetriage jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139424 (https://phabricator.wikimedia.org/T388536) (owner: 10Hnowlan)
[13:52:49] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138443|Fix: PHP Warning: Undefined array key "request" (T392026)]], [[gerrit:1139038|Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist]], [[gerrit:1139039|manage-dblist: Default all new wikis to parsoidrendered (T376827)]], [[gerrit:1139047|nupwiki: Enable Parsoid mode (T390384)]] (durati
[13:52:49] <logmsgbot>	 on: 11m 56s)
[13:52:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[13:52:55] <stashbot>	 T392026: PHP Warning: Undefined array key "request" - https://phabricator.wikimedia.org/T392026
[13:52:55] <stashbot>	 T376827: Add a new checklist item to the Wiki creation process for Parsoid Read Views - https://phabricator.wikimedia.org/T376827
[13:52:55] <stashbot>	 T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384
[13:53:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2092
[13:53:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:53:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2092 to cirrussearch2092
[13:54:24] <James_F>	 !log Deployment window complete.
[13:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2092.codfw.wmnet with OS bullseye
[13:54:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2092
[13:56:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet
[13:56:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet
[13:57:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:57:22] <wikibugs>	 (03PS3) 10Jforrester: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis)
[13:57:23] <wikibugs>	 (03PS4) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[13:57:49] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "PS3: Rebase and re-gen to add nupwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis)
[13:58:03] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: testing ECH
[13:58:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[13:58:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "PS4: Rebase over my addition of the new parsoidrendered dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[13:58:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[13:58:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:58:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[13:59:06] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:59:32] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Add missing spaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478
[13:59:34] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:39] <Lucas_WMDE>	 jouncebot: nowandnext
[13:59:40] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1300)
[13:59:40] <jouncebot>	 In 1 hour(s) and 30 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1530)
[13:59:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance::campaignevents: migrate remaining updateutcts jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139428 (https://phabricator.wikimedia.org/T385867) (owner: 10Hnowlan)
[14:00:37] <Lucas_WMDE>	 I’ll quickly roll out that code style cleanup
[14:01:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2092 - bking@cumin2002"
[14:01:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2092 - bking@cumin2002"
[14:01:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:01:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2092.codfw.wmnet 228.16.192.10.in-addr.arpa 8.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:01:15] <wikibugs>	 (03PS5) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[14:01:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2092.codfw.wmnet 228.16.192.10.in-addr.arpa 8.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:01:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2092
[14:01:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE))
[14:01:58] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:02:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2092
[14:02:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2092
[14:02:41] <wikibugs>	 (03Merged) 10jenkins-bot: manage-dblist: Add missing spaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE))
[14:02:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]]
[14:02:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481
[14:03:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:04:14] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org
[14:06:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:06:28] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:58] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:07:14] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum3003 is OK: OK: UP (pid=1383598) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[14:07:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:07:24] <Lucas_WMDE>	 I’ll do a very cursory test
[14:07:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:07:48] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:08:10] <Lucas_WMDE>	 seems to work afaict
[14:08:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[14:08:46] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:08:50] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:08:50] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:09:34] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:09:34] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:09:44] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum3003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:09:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:09:46] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:09:46] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:09:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:09:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:09:50] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:09:50] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:10:08] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:10:48] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:10:48] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:11:10] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:11:33] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772794 (10ArthurPSmith) @Silvan_WMDE Thanks for working on this! I would note that t...
[14:12:00] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:13:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[14:13:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org
[14:13:56] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:13:56] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:14:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:14:56] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:15:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139478|manage-dblist: Add missing spaces]] (duration: 12m 12s)
[14:16:10] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:16:28] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people2003.codfw.wmnet with reason: T391357
[14:16:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.
[14:16:32] <icinga-wm>	 ikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worke
[14:16:32] <icinga-wm>	 iad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:16:32] <stashbot>	 T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357
[14:16:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1049.eqiad.
[14:16:40] <icinga-wm>	 ikikube-worker1094.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worke
[14:16:40] <icinga-wm>	 iad.wmnet, wikikube-worker1102.eqiad.wmnet, wikikube-worker1002.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, wikikube-worker1130.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:16:46] <sukhe>	 hello
[14:16:57] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:17:23] <sukhe>	 !incidents
[14:17:24] <sirenbot>	 6058 (UNACKED)  ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad)
[14:17:24] <sirenbot>	 6056 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:17:26] <sukhe>	 !ack 6058
[14:17:27] <sirenbot>	 6058 (ACKED)  ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad)
[14:17:50] <godog>	 thank you sukhe
[14:17:54] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772800 (10elukey) The host was reimaged on the 5th afaics:  ` 2024-05-06 09:10:50,421 marostegui 595479 [DEBUG _cookbook.py:511 in main] Executing cookbook sre.hosts.reimage with args: ['--os', 'bookworm', '-t', 'T363...
[14:18:00] <sukhe>	 godog: no worries, now on to finding out how to debug this :D
[14:18:05] <godog>	 lol indeed
[14:18:09] <hnowlan>	 o/ 
[14:18:10] * Lucas_WMDE done deploying btw
[14:18:13] <hnowlan>	 I will have a look also 
[14:18:16] <sukhe>	 hi hnowlan :)
[14:18:16] <sukhe>	 <3
[14:18:56] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:18:56] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:19:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2092.codfw.wmnet with reason: host reimage
[14:19:04] <Raine>	 thanks hnowlan <3 
[14:19:14] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on planet2003.codfw.wmnet with reason: reboot
[14:19:45] <hnowlan>	 huh, every worker is busy 
[14:19:47] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on planet1003.eqiad.wmnet with reason: reboot
[14:20:26] <Raine>	 mhm
[14:20:28] <hnowlan>	 that should be highly unlikely, but we have 4 instances of mw-videoscaler running in parallel 
[14:20:41] <hnowlan>	 temporary fix is to bump replicas and let this get cleaned up
[14:20:56] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:20:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:21:10] <jinxer-wm>	 FIRING: [14x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:21:20] <Raine>	 hnowlan: I can do that
[14:22:04] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484
[14:22:14] <hnowlan>	 Raine: oh sorry, was in the other tab doing ^ :D
[14:22:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2092.codfw.wmnet with reason: host reimage
[14:22:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan)
[14:22:35] <Raine>	 hnowlan: oh, okay, thanks :D 
[14:22:44] <hnowlan>	 that comment above the value is looking a little silly now >_> 
[14:23:07] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:23:09] <hnowlan>	 someone must have uploaded something big 
[14:23:37] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:23:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:03] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan)
[14:24:35] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:25:09] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:25:55] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139484 (owner: 10Hnowlan)
[14:26:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[14:26:10] <jinxer-wm>	 RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:26:37] <hnowlan>	 aghhh this is going to fail because it'll hit the resource limits 
[14:26:46] <Raine>	 uh
[14:27:18] <wikibugs>	 (03CR) 10Jforrester: "Oh, sorry, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE))
[14:27:37] <hnowlan>	 (maybe)
[14:28:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "np, not your fault that phpcs isn’t running :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139478 (owner: 10Lucas Werkmeister (WMDE))
[14:28:07] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:28:35] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:28:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[14:28:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:28:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org
[14:29:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:29:37] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:29:48] <sukhe>	 thanks hnowlan!
[14:30:09] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:30:29] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:30:29] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:30:33] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772825 (10taavi) Anything left to do here?
[14:30:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10772827 (10elukey) p:05Triage→03Medium
[14:30:55] <hnowlan>	 sukhe: it might come back unfortunately, I'll keep looking 
[14:31:06] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[14:31:20] <sukhe>	 hnowlan: hth if on-callers can, please let usknow
[14:31:24] <wikibugs>	 (03CR) 10Muehlenhoff: "Was approved in the weekly SRE IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[14:31:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#10772839 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:31:52] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10772840 (10Silvan_WMDE) >>! In T374230#10772794, @ArthurPSmith wrote: > Does the fix...
[14:31:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:32:01] <sukhe>	 nice
[14:32:01] <hnowlan>	 sukhe: thanks. The good news is that as it stands this isn't creating user-facing errors 
[14:32:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10772843 (10ayounsi) p:05Triage→03Medium
[14:32:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.
[14:32:31] <icinga-wm>	 ikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1161.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1069.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worke
[14:32:31] <icinga-wm>	 iad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:32:36] <sukhe>	 hmm
[14:32:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1067.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.
[14:32:41] <icinga-wm>	 ikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke
[14:32:41] <icinga-wm>	 iad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1256.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:32:45] <hnowlan>	 sigh 
[14:32:47] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:32:47] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:32:49] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:32:57] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:33:01] <sukhe>	 !ack 6059
[14:33:02] <sirenbot>	 6059 (ACKED)  ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad)
[14:33:18] <sukhe>	 hnowlan: I guess time to track down what's causing this?
[14:33:24] <hnowlan>	 the 4 scap runs in series created 4 workers 
[14:33:30] <hnowlan>	 all of which are doing long-running transcodes
[14:33:33] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10772850 (10jhathaway) Thanks @elukey, perhaps puppetserver needs to be reloaded to pick up the revoke, and this didn't happen until more recently?
[14:33:49] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:33:59] <hnowlan>	 unfortunately I think the best way to stem the bleeding is to kill one which will cause a small number of transcodes to fail, but otherwise we can't be use about how long this will go on
[14:34:14] <wikibugs>	 06SRE, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10772852 (10joanna_borun)
[14:34:25] <Raine>	 hnowlan: those transcodes might even get retried automatically, no?
[14:34:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:34:45] <hnowlan>	 maybe, hopefully :) 
[14:35:01] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on planet2003.codfw.wmnet with reason: T391357
[14:35:05] <Raine>	 go for it then
[14:35:05] <stashbot>	 T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357
[14:35:19] <sukhe>	 recoveries coming in again. what do we usually do when a long running transcode is in progress like this?
[14:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:28] <sukhe>	 ah nvm, I see the message at :33
[14:35:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:35:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[14:35:49] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#10772856 (10Krinkle)
[14:35:49] <hnowlan>	 deleted a pod 
[14:36:16] <sukhe>	 !incidents
[14:36:17] <sirenbot>	 6059 (ACKED)  ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad)
[14:36:17] <sirenbot>	 6058 (RESOLVED)  ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad)
[14:36:17] <sirenbot>	 6056 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:36:18] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10772857 (10Krinkle)
[14:36:25] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:36:35] <Raine>	 thank you hnowlan <3
[14:36:40] <sukhe>	 indeed <3
[14:36:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[14:36:47] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:36:47] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:37:41] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:37:49] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:37:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:38:09] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[14:38:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.
[14:38:31] <icinga-wm>	 ikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke
[14:38:31] <icinga-wm>	 iad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:38:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1025.eqiad.
[14:38:41] <icinga-wm>	 ikikube-worker1315.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worke
[14:38:41] <icinga-wm>	 iad.wmnet, wikikube-worker1070.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal
[14:38:41] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:38:49] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:38:52] <sukhe>	 Monday is turning out to be fun :)
[14:39:22] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough
[14:39:43] <hnowlan>	 sorry about this, pods are restarting themselves (when they shouldn't be? unclear)
[14:40:03] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases2003.codfw.wmnet with reason: T391357
[14:40:07] <stashbot>	 T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357
[14:40:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:40:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:41:19] <hnowlan>	 okay, terminated all old videoscalers 
[14:41:25] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:41:38] <hnowlan>	 -video pods will take a little bit to clear up though as the jobs have to finish, unfortunately 
[14:41:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[14:41:51] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:41:53] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:42:42] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stewards2001.codfw.wmnet with reason: T391357
[14:42:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org
[14:42:51] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:42:51] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:43:07] <godog>	 ack, thanks for the update hnowlan 
[14:43:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2092.codfw.wmnet with OS bullseye
[14:43:52] <wikibugs>	 (03PS2) 10Ssingh: gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[14:44:55] <hnowlan>	 because all replicas are busy in -video, the odds of the apply I did 10 minutes ago failing are quite high. if that happens, we might see another page, and if we do I will just manually bump replicas 
[14:45:10] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts2002.codfw.wmnet with reason: T391357
[14:45:14] <stashbot>	 T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357
[14:45:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet
[14:45:44] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[14:45:48] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10772886 (10ssingh) >>! In T379927#10772825, @taavi wrote: > Anything left to do here?  Nothing on the prod DNS hosts side; if you k...
[14:46:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[14:46:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[14:46:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10772887 (10MoritzMuehlenhoff)
[14:46:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang)
[14:46:22] <sukhe>	 hashar: ^
[14:46:29] <sukhe>	 deploying https://gerrit.wikimedia.org/r/c/operations/dns/+/1138996
[14:46:33] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:47:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[14:47:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet
[14:47:55] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech4_amd64.changes: T205378
[14:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:00] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[14:48:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:49:01] <Raine>	 thanks some more hnowlan <3
[14:49:02] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:50:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet
[14:51:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:51:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:54:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet
[14:54:57] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Fix indentation and stray blank line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819)
[14:54:58] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Fix some random phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819)
[14:55:01] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819)
[14:55:37] <sukhe>	 !log re-enable puppet and force agent run on A:durum
[14:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:53] <icinga-wm>	 PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3753 MB (3% inode=92%): /tmp 3753 MB (3% inode=92%): /var/tmp 3753 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops
[14:57:09] <wikibugs>	 (03CR) 10Joal: [C:03+1] "+1, it doesn't really change anything on our end :)" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur)
[14:57:31] <Amir1>	 !log CREATE INDEX cxs_source_language_title ON cx_suggestions (cxs_source_language, cxs_title); on wikishared (T390510)
[14:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:42] <stashbot>	 T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510
[14:57:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org
[14:58:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2001.codfw.wmnet
[14:58:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet
[14:59:04] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:01:50] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:01:50] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:04:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:04:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet
[15:05:24] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:50] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:05:50] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:07:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:48] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[15:08:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet
[15:10:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[15:11:33] <Amir1>	 !log CREATE INDEX translation_started_by_last_updated_timestamp ON cx_translations (translation_started_by, translation_last_updated_timestamp); (T390510)
[15:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:38] <stashbot>	 T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510
[15:13:08] <wikibugs>	 (03CR) 10LorenMora: [C:03+1] Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang)
[15:13:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:03] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10773031 (10tappof) @wiki_willy, I was able to split the PDUs in a 'per row' manner. If you're looking at a PoP, this is equiva...
[15:14:05] <logmsgbot>	 sukhe@cumin1002 roll-reboot (PID 3625946) is awaiting input
[15:14:21] <sukhe>	 er
[15:14:25] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729) (owner: 10Elukey)
[15:15:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppetserver: update sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway)
[15:15:14] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [reason: reboot finished]
[15:15:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org
[15:15:25] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org
[15:15:54] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org
[15:17:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.move-vlan: improve grep reports when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1139407 (https://phabricator.wikimedia.org/T392729) (owner: 10Elukey)
[15:17:56] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[15:19:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:24:34] <wikibugs>	 (03CR) 10JHathaway: "@mmuhlenhoff@wikimedia.org per our IRL discussion the other piece of timer validation is here, https://gerrit.wikimedia.org/r/plugins/giti" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway)
[15:25:44] <wikibugs>	 (03PS1) 10Ayounsi: Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503
[15:25:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:25:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2093 to cirrussearch2093
[15:26:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:26:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10773102 (10MoritzMuehlenhoff)
[15:28:06] <icinga-wm>	 RECOVERY - Check unit status of backup-kdc-database on krb1002 is OK: OK: Status of the systemd unit backup-kdc-database https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:28:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:30:05] <jouncebot>	 jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1530).
[15:30:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2093 to cirrussearch2093 - bking@cumin2002"
[15:30:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2093 to cirrussearch2093 - bking@cumin2002"
[15:30:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:30:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2093
[15:30:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:30:54] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org
[15:31:03] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull # T389344, I don’t want to wait until the next Puppet run in 26 minutes
[15:31:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2093
[15:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:07] <stashbot>	 T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344
[15:31:20] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10773131 (10elukey) One thing that I see is that the reimage failed:  ` 2024-05-06 10:31:27,368 marostegui 595479 [INFO _log.py:125 in log_task_end] END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178...
[15:31:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2093 to cirrussearch2093
[15:31:46] <icinga-wm>	 RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops
[15:32:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2093.codfw.wmnet on all recursors
[15:32:22] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) cirrussearch2093.codfw.wmnet on all recursors
[15:32:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2093.codfw.wmnet with OS bullseye
[15:33:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2093
[15:34:06] <icinga-wm>	 RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops
[15:34:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:34:50] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:34:50] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:36:26] <zip>	 I'd like to use `deleteBatch.php` to delete a set of broken Flow boards on gomwiki... any issues with doing so?
[15:36:50] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:36:50] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:37:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2093 - bking@cumin2002"
[15:38:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2093 - bking@cumin2002"
[15:38:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2093.codfw.wmnet 229.16.192.10.in-addr.arpa 9.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:38:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2093.codfw.wmnet 229.16.192.10.in-addr.arpa 9.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:38:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2093
[15:38:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[15:39:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2093
[15:39:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2093
[15:39:26] <moritzm>	 !log installing edk2 security updates
[15:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:42:45] <zip>	 right then, going ahead
[15:42:53] <logmsgbot>	 !log zoe@deploy1003 manually-logged T389247 Beginning deletion of broken gomwiki flow boards
[15:42:57] <stashbot>	 T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247
[15:43:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis)
[15:43:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:45:29] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org
[15:45:36] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[15:45:48] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox
[15:46:12] <sukhe>	 !log pause execution of sre.dns.roll-reboot to figure out skipping of Icinga service warning
[15:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2005.wikimedia.org
[15:46:25] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2005.wikimedia.org
[15:46:33] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns2005.wikimedia.org [reason: reboot finished]
[15:47:23] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull --ff-only # T389344
[15:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:28] <stashbot>	 T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344
[15:48:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[15:48:49] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:49:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet
[15:49:36] <wikibugs>	 (03CR) 10Nik Gkountas: Catalog ContentTranslation tables (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[15:49:44] <wikibugs>	 (03PS2) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094)
[15:51:22] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:51:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:51:41] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10773259 (10Jelto) a:03Jelto
[15:51:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[15:52:11] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:52:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet
[15:53:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:46] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:55:20] <logmsgbot>	 !log zoe@deploy1003 manually-logged T389247 Completed deletion of broken gomwiki flow boards
[15:55:24] <stashbot>	 T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247
[15:56:19] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache: use fqdn in haproxykafka hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur)
[15:56:26] <logmsgbot>	 bking@cumin2002 reimage (PID 1020203) is awaiting input
[15:57:13] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10773280 (10LSobanski) p:05Triage→03Medium
[15:57:16] <wikibugs>	 (03PS3) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724)
[15:57:43] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:58:25] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834 (10Urbanecm_WMF) 03NEW
[15:59:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[15:59:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:59:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75520 and previous config saved to /var/cache/conftool/dbconfig/20250428-155924-fceratto.json
[15:59:36] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773310 (10Urbanecm_WMF) Feels like something's filling things up. I removed some files I no longer need in my home, which got it at 99% and 1.9G space available. At this point, less than 900M is available (so about a GB worth o...
[15:59:36] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2093.codfw.wmnet with OS bullseye
[15:59:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet
[16:00:01] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773313 (10Urbanecm_WMF) p:05Triage→03Unbreak! Provisionally, server fully out of space doesn't seem like a good idea. Feel free to lower if you think that's appropriate.
[16:00:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet
[16:03:34] <logmsgbot>	 !log zoe@deploy1003 manually-logged T389247 attempting migration
[16:03:38] <stashbot>	 T389247: Run Flow migration script at *gomwiki* - https://phabricator.wikimedia.org/T389247
[16:03:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti7003:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:04:49] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10773340 (10wiki_willy) Thanks @tappof, that looks perfect.  Thanks for splitting it up by rack!  I went through and checked th...
[16:07:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75523 and previous config saved to /var/cache/conftool/dbconfig/20250428-160734-fceratto.json
[16:07:36] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773343 (10Urbanecm_WMF) And we're at zero availability:  ` [urbanecm@mwmaint1002 ~]$ df -h Filesystem                        Size  Used Avail Use% Mounted on [...] /dev/mapper/mwmaint1002--vg-root  121G  116G     0 100% / [...]...
[16:09:04] <wikibugs>	 (03PS1) 10Ssingh: P:auth: temporarily skip returning a WARN on check_authdns_state [puppet] - 10https://gerrit.wikimedia.org/r/1139510
[16:10:13] <wikibugs>	 (03CR) 10Ebernhardson: [C:04-1] "The .deb to be built is at https://gitlab.wikimedia.org/repos/search-platform/opensearch-madvise/" [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[16:11:04] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773350 (10elukey) ` elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail 553M /home/ebernhardson 842M /home/catrope 1.2G /home/brion 1.3G /home/tstarling 1.7G /home/oblivian 1.7G /home/samtar 2.1G /home/cparle 11G /ho...
[16:11:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:auth: temporarily skip returning a WARN on check_authdns_state [puppet] - 10https://gerrit.wikimedia.org/r/1139510 (owner: 10Ssingh)
[16:11:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] "self-merging since this is a trivial Icinga check change and will be reverted." [puppet] - 10https://gerrit.wikimedia.org/r/1139510 (owner: 10Ssingh)
[16:11:44] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773353 (10dancy) Big directories are:  `/var/log`: 42GB  and   ` 22.8 GiB [##########] /home/zabe 14.9 GiB [######    ] /home/ladsgroup 10.9 GiB [####      ] /home/legoktm `
[16:12:30] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773355 (10elukey) And also:  ` elukey@mwmaint1002:/var/log/mediawiki$ sudo du -hs * | sort -h | tail 505M mediawiki_job_mediamoderation-hourlyScan 519M mediawiki_job_purge_checkuser 546M mediawiki_job_cirrus_build_completion_in...
[16:13:56] <A_smart_kitten>	 hey; please could i get a second opinion on / please could someone check if they can reproduce T392832 on their device? i'm increasingly feeling like it might be severe enough to be a train blocker for the upcoming train, and if it is i want to get it flagged to the right people sooner rather than later :)
[16:13:56] <stashbot>	 T392832: Unable to access the revision-deletion interface from Special:Log - an "Invalid target revision" error page is displayed - https://phabricator.wikimedia.org/T392832
[16:14:01] <A_smart_kitten>	 (asking in -operations rather than in #mediawiki or anywhere else because of my worry that this might be a train blocker)
[16:14:04] <sukhe>	 !log force agent run on A:dnsbox to merge CR 1139510
[16:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:32] <wikibugs>	 (03Abandoned) 10Hashar: [WIP] Stub LimeSurvey configuration [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis)
[16:15:36] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773366 (10Ladsgroup) I think something is broken with log rotation. When I was checking logs for systemd timer logs, I found stuff from years ago.
[16:16:19] <taavi>	 A_smart_kitten: yeah. I can repro on deployment-prep
[16:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:17:03] <A_smart_kitten>	 taavi: thanks for the check!
[16:17:52] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773379 (10Ladsgroup) I deleted my old backup logs. That saves up 14GB but logs needs to be cleaned up.
[16:20:19] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570)
[16:20:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:21:04] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5370/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:21:09] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:21:24] <wikibugs>	 (03CR) 10David Caro: [C:03+1] P:toolforge: prometheus: Use DNS names to look up scrape targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:21:33] <wikibugs>	 (03CR) 10David Caro: [C:03+1] P:toolforge: prometheus: Use DNS names to look up scrape targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:21:54] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570)
[16:22:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: prometheus: Remove duplication in relabel configs [puppet] - 10https://gerrit.wikimedia.org/r/1139454 (owner: 10Majavah)
[16:22:20] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:22:25] <wikibugs>	 (03Abandoned) 10Hashar: Tools: Puppetize gridengine global configuration [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt)
[16:22:28] <wikibugs>	 (03Abandoned) 10Hashar: sge: Fix global config handling [puppet] - 10https://gerrit.wikimedia.org/r/351379 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy)
[16:22:34] <wikibugs>	 (03Abandoned) 10Hashar: gridengine: Cleanup mergeconf script and references [puppet] - 10https://gerrit.wikimedia.org/r/352281 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy)
[16:22:38] <wikibugs>	 (03Abandoned) 10Hashar: gridengine: Cleanup old scripts, tracker and collector [puppet] - 10https://gerrit.wikimedia.org/r/352294 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy)
[16:22:41] <wikibugs>	 (03Abandoned) 10Hashar: gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy)
[16:22:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P75525 and previous config saved to /var/cache/conftool/dbconfig/20250428-162242-fceratto.json
[16:22:44] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570)
[16:22:45] <wikibugs>	 (03Abandoned) 10Hashar: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy)
[16:24:54] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7003.magru.wmnet
[16:25:10] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] P:toolforge: prometheus: Use DNS names to look up scrape targets [puppet] - 10https://gerrit.wikimedia.org/r/1139455 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[16:25:54] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570)
[16:26:32] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and not (A:eqiad or A:codfw) and A:dnsbox
[16:26:33] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org
[16:27:31] <wikibugs>	 06SRE: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773450 (10Tgr) The GrowthExperiments logs seem properly rotated, there are daily logfiles going back two weeks, and the log entry dates match the file date. It just seems to be creating a huge amount of logs.
[16:27:54] <wikibugs>	 (03PS3) 10Nik Gkountas: Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094)
[16:28:13] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7003.magru.wmnet
[16:28:18] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:29:48] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs7003*} and A:liberica
[16:30:19] <wikibugs>	 (03PS1) 10Gergő Tisza: mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834)
[16:30:32] <logmsgbot>	 !log sukhe@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) pooling P{lvs7003*} and A:liberica
[16:30:38] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:30:46] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:16] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773473 (10Urbanecm_WMF) Hmm... I just discovered mwmaint2002's disk is significantly larger than 1002's (430G vs 120G). Should we even have servers with the same role with very different diskspace?
[16:33:19] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10773474 (10Tgr) `listTaskCounts` uses `--output none` already, that 3G is entirely job runner boilerplate (a ton of rows like `Apr 18 15:11:00 mwmaint1002 mediawiki_job_growthexperiments-listTaskCounts[9828...
[16:34:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:35:11] <wikibugs>	 (03PS1) 10Zoe: Set flow boards readonly on fiwikimedia and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909)
[16:35:40] <wikibugs>	 (03PS2) 10Zoe: Set flow boards readonly on fiwikimedia, gomwiki and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909)
[16:35:52] <icinga-wm>	 RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops
[16:36:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe)
[16:37:09] <wikibugs>	 (03PS1) 10DCausse: cirrus: re-enable completion index rebuild in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1139518
[16:37:27] <wikibugs>	 (03CR) 10BCornwall: "I forgot about this CR, sorry! I have since included this via I10c6d5e169972d44569b801d532d4759a6fd3e73" [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy)
[16:37:35] <wikibugs>	 (03Abandoned) 10BCornwall: certificates.yaml: Add pywikipedia.org to non-canonical-redirect [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy)
[16:37:40] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:37:46] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:37:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P75526 and previous config saved to /var/cache/conftool/dbconfig/20250428-163749-fceratto.json
[16:39:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:40:00] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org
[16:45:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:45:54] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:46:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10773538 (10jhathaway) great thanks @fgiunchedi!
[16:46:52] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:46:52] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:50:01] <wikibugs>	 (03CR) 10JHathaway: "looks good, just a doc request" [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi)
[16:52:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75527 and previous config saved to /var/cache/conftool/dbconfig/20250428-165257-fceratto.json
[16:53:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[16:53:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75528 and previous config saved to /var/cache/conftool/dbconfig/20250428-165323-fceratto.json
[16:55:00] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org
[16:58:46] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:59:08] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1700)
[17:00:04] <jouncebot>	 ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T1700).
[17:02:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75529 and previous config saved to /var/cache/conftool/dbconfig/20250428-170244-fceratto.json
[17:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:04:46] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:05:10] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:11:32] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org
[17:13:21] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people1004.eqiad.wmnet with reason: reboot
[17:15:12] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on stewards1001.eqiad.wmnet with reason: reboot
[17:17:01] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc2003.codfw.wmnet with reason: reboot
[17:17:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P75530 and previous config saved to /var/cache/conftool/dbconfig/20250428-171752-fceratto.json
[17:18:50] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on etherpad1004.eqiad.wmnet with reason: reboot
[17:23:42] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:24:04] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[17:24:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75531 and previous config saved to /var/cache/conftool/dbconfig/20250428-172410-ladsgroup.json
[17:24:38] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2093-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:26:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cirrussearch2093:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:26:32] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org
[17:27:54] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:29:52] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:29:52] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:29:54] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:30:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:30:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:31:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:32:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:32:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75532 and previous config saved to /var/cache/conftool/dbconfig/20250428-173250-ladsgroup.json
[17:32:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P75533 and previous config saved to /var/cache/conftool/dbconfig/20250428-173259-fceratto.json
[17:34:41] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:35:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:36:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:36:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:37:14] <wikibugs>	 (03CR) 10Dzahn: "This is correct but I would like to add that" [homer/public] - 10https://gerrit.wikimedia.org/r/1139406 (https://phabricator.wikimedia.org/T392793) (owner: 10Majavah)
[17:37:50] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:37:52] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:37:54] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:39:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:42:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:43:04] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org
[17:45:52] <icinga-wm>	 PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3040 MB (2% inode=92%): /tmp 3040 MB (2% inode=92%): /var/tmp 3040 MB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops
[17:46:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2100 to cirrussearch2100
[17:46:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:47:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P75534 and previous config saved to /var/cache/conftool/dbconfig/20250428-174757-ladsgroup.json
[17:48:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T392806)', diff saved to https://phabricator.wikimedia.org/P75535 and previous config saved to /var/cache/conftool/dbconfig/20250428-174806-fceratto.json
[17:48:25] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[17:48:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75536 and previous config saved to /var/cache/conftool/dbconfig/20250428-174831-fceratto.json
[17:48:47] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:36] <wikibugs>	 (03PS6) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550)
[17:52:23] <logmsgbot>	 bking@cumin2002 rename (PID 1154737) is awaiting input
[17:54:46] <wikibugs>	 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10773808 (10Yann) https://commons.wikimedia.org/wiki/File:Rembrandt_-_The_Abduction_of_Europa_-_Google_Art_Project.jpg thumbnails failed, but https://commons.wikimedia.org/wiki/File:Rembrandt_...
[17:56:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75537 and previous config saved to /var/cache/conftool/dbconfig/20250428-175657-fceratto.json
[17:58:04] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org
[17:59:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:59:54] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:01:52] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:01:54] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:01:54] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:03:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P75538 and previous config saved to /var/cache/conftool/dbconfig/20250428-180304-ladsgroup.json
[18:03:39] <wikibugs>	 (03PS1) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378)
[18:03:49] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:04:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:04:42] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5371/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:05:41] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:05:51] <wikibugs>	 (03PS2) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378)
[18:06:18] <wikibugs>	 (03CR) 10AOkoth: miscweb: change os-reports runtime owner (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:06:55] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5372/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:07:06] <wikibugs>	 (03PS3) 10Ssingh: P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378)
[18:07:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2100 to cirrussearch2100 - bking@cumin2002"
[18:07:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2100 to cirrussearch2100 - bking@cumin2002"
[18:07:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:07:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2100
[18:07:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2100
[18:08:06] <Amir1>	 !log CREATE INDEX translation_last_update_by_last_updated_timestamp ON cx_translations (translation_last_update_by, translation_last_updated_timestamp); (T392839 and T390510)
[18:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:12] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[18:08:12] <stashbot>	 T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510
[18:08:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:08:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2100 to cirrussearch2100
[18:08:52] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:08:54] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:08:54] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:09:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2100.codfw.wmnet on all recursors
[18:09:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2100.codfw.wmnet on all recursors
[18:09:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2100.codfw.wmnet with OS bullseye
[18:09:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2100
[18:10:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[18:10:41] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:11:47] <wikibugs>	 (03PS1) 10Ssingh: Revert "P:auth: temporarily skip returning a WARN on check_authdns_state" [puppet] - 10https://gerrit.wikimedia.org/r/1139529
[18:11:56] <Amir1>	 !log CREATE INDEX cxl_owner ON cx_lists (cxl_owner); (T392839 and T390510)
[18:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P75539 and previous config saved to /var/cache/conftool/dbconfig/20250428-181204-fceratto.json
[18:12:46] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: T392804
[18:13:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:14:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:14:39] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: T392804
[18:14:39] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns4004.wikimedia.org
[18:14:40] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns4004.wikimedia.org
[18:14:52] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: reboot finished]
[18:15:13] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org
[18:15:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2100 - bking@cumin2002"
[18:15:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2100 - bking@cumin2002"
[18:15:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:15:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2100.codfw.wmnet 219.32.192.10.in-addr.arpa 9.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[18:15:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2100.codfw.wmnet 219.32.192.10.in-addr.arpa 9.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[18:15:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2100
[18:15:43] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[18:18:09] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[18:18:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T392806)', diff saved to https://phabricator.wikimedia.org/P75540 and previous config saved to /var/cache/conftool/dbconfig/20250428-181811-ladsgroup.json
[18:18:33] <logmsgbot>	 bking@cumin2002 reimage (PID 1179440) is awaiting input
[18:23:26] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134)
[18:24:58] <wikibugs>	 (03PS4) 10Ryan Kemper: sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150)
[18:27:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P75542 and previous config saved to /var/cache/conftool/dbconfig/20250428-182711-fceratto.json
[18:28:55] <wikibugs>	 (03CR) 10Ryan Kemper: "That's a good point, I think the backend request will be a bit more straightforward. I'll try that approach first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138935 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[18:29:09] <wikibugs>	 (03CR) 10Ladsgroup: "If Growth team is okay with it, I can deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza)
[18:29:20] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[18:30:03] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aphlict2001.codfw.wmnet with reason: Bookworm Reboot
[18:30:13] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org
[18:30:49] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aphlict2001.codfw.wmnet
[18:31:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:32:06] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:34:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:34:02] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:34:10] <sukhe>	 ^ expected
[18:34:37] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aphlict2001.codfw.wmnet
[18:34:56] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:35:26] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: only log ECH status for ECH-enabled clients [puppet] - 10https://gerrit.wikimedia.org/r/1139525 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:36:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:37:16] <sukhe>	 !log run agent on A:durum
[18:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:35] <wikibugs>	 (03PS1) 10AOkoth: aphlict: ensure on passive host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128)
[18:40:02] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:40:02] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:42:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T392806)', diff saved to https://phabricator.wikimedia.org/P75543 and previous config saved to /var/cache/conftool/dbconfig/20250428-184217-fceratto.json
[18:42:36] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[18:42:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75544 and previous config saved to /var/cache/conftool/dbconfig/20250428-184243-fceratto.json
[18:43:24] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: T392804
[18:45:08] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: T392804
[18:45:28] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet
[18:46:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:47:50] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org
[18:48:50] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844 (10RobH) 03NEW
[18:49:11] <logmsgbot>	 !log brett@cumin2002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on 14 hosts with reason: upgrades
[18:49:11] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10773978 (10RobH)
[18:49:57] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845 (10RobH) 03NEW
[18:50:01] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir1001.eqiad.wmnet
[18:50:03] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 14 hosts with reason: upgrades
[18:50:18] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10773997 (10RobH)
[18:50:22] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir1002.eqiad.wmnet
[18:50:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75545 and previous config saved to /var/cache/conftool/dbconfig/20250428-185043-fceratto.json
[18:50:56] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10773999 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se...
[18:51:09] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10774003 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se...
[18:54:49] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir1002.eqiad.wmnet
[18:55:21] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2001.codfw.wmnet
[18:57:05] <wikibugs>	 (03CR) 10Dzahn: "The message says this enables it on the passive host. But it's disabling it on the active host." [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[18:57:33] <wikibugs>	 (03PS1) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422)
[18:58:31] <wikibugs>	 (03CR) 10Dzahn: "You can either just delete any setting here at hosts level.. it would still be present on both and should be no change at all in compiler." [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[18:58:32] <wikibugs>	 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774042 (10jrbs)
[18:59:16] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2001.codfw.wmnet
[18:59:30] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[19:01:30] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet
[19:02:50] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org
[19:02:55] <wikibugs>	 (03PS2) 10AOkoth: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128)
[19:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:05:06] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:05:10] <wikibugs>	 10SRE-tools, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848 (10ssingh) 03NEW
[19:05:31] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10774076 (10ssingh) p:05Triage→03Low
[19:05:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P75546 and previous config saved to /var/cache/conftool/dbconfig/20250428-190550-fceratto.json
[19:05:57] <wikibugs>	 (03CR) 10Dzahn: "Yes, now it matches what it does. technically there is no need to add the aphlict2001.yaml at all.. since present is default. But if you a" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[19:06:54] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet
[19:07:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:07:02] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:07:34] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir(3|4|5|6|7)001.*
[19:07:35] <wikibugs>	 (03CR) 10Dzahn: "nitpick: not needed to allow failover.. and we could also just leave the service running on both..DNS switch alone should do it. but it do" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[19:08:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:08:30] <wikibugs>	 (03PS3) 10Dzahn: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[19:09:11] <wikibugs>	 (03PS4) 10Dzahn: aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[19:09:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "+1 but only AFTER DNS change" [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[19:13:02] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:13:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:16:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:28] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3003.*
[19:17:21] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org
[19:18:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:18:10] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Matches plan outlined in ticket" [puppet] - 10https://gerrit.wikimedia.org/r/1139531 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[19:18:19] <wikibugs>	 (03PS1) 10Ssingh: hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378)
[19:19:18] <wikibugs>	 (03PS2) 10Ssingh: hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378)
[19:20:43] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:20:45] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:20:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P75547 and previous config saved to /var/cache/conftool/dbconfig/20250428-192057-fceratto.json
[19:21:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:22:45] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir3003.esams.wmnet
[19:23:09] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir[4-7]001.*
[19:24:37] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir[3-7]002.*
[19:24:51] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3004.*
[19:27:22] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84280MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[19:28:11] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir.*
[19:28:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for 14 hosts
[19:29:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts
[19:30:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: use read-only object storage credentials on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[19:32:21] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org
[19:34:15] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: upgrades
[19:35:51] <wikibugs>	 (03PS1) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128)
[19:35:54] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:36:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T392806)', diff saved to https://phabricator.wikimedia.org/P75548 and previous config saved to /var/cache/conftool/dbconfig/20250428-193605-fceratto.json
[19:36:10] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:36:26] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[19:36:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75549 and previous config saved to /var/cache/conftool/dbconfig/20250428-193632-fceratto.json
[19:38:10] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:39:13] <wikibugs>	 (03PS1) 10BCornwall: acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549
[19:40:10] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:40:54] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:41:10] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:41:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall)
[19:42:37] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org
[19:43:35] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5378/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall)
[19:45:42] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] acmechief: Switch active/passive instances [puppet] - 10https://gerrit.wikimedia.org/r/1139549 (owner: 10BCornwall)
[19:47:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75550 and previous config saved to /var/cache/conftool/dbconfig/20250428-194708-fceratto.json
[19:48:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[19:52:10] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:53:46] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:54:10] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:55:22] <brett>	 !log Upgrade/reboot acme-chief servers
[19:55:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:37] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org
[19:58:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851 (10RobH) 03NEW
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2000).
[20:00:05] <jouncebot>	 danisztls, bwang, and bd808: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:31] <bd808>	 o/
[20:00:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851#10774204 (10RobH) a:03ssingh @ssingh,  We didn't get racking details on ordering task T389840, so can you populate the racking details on this racking task?  Additionally, please update the site....
[20:01:11] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:01:22] <danisztls>	 o/
[20:01:30] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp40[53-68] - https://phabricator.wikimedia.org/T392851#10774214 (10RobH)
[20:02:14] <bwang>	 o/
[20:02:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P75551 and previous config saved to /var/cache/conftool/dbconfig/20250428-200215-fceratto.json
[20:03:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:03:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:04:03] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:04:59] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:05:20] <bd808>	 danisztls and bwang: I can do the needful since it looks like the other deployers aren't here at the moment.
[20:06:10] <bd808>	 It doesn't look like any of our changes are easily testable on the staging servers.
[20:07:11] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:07:13] <danisztls>	 bd808: yes, thanks
[20:10:47] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet
[20:10:51] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet
[20:10:55] <bd808>	 woah. what's this massive pile of "No space left on device" errors?
[20:11:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2100
[20:11:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2100
[20:11:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm
[20:11:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774239 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm
[20:11:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:11:51] <dancy>	 bd808: mwmaint1002  ?
[20:11:54] <bd808>	 zabe: are you around? It looks like your job that is running migrateESRefToContentTable.php is having a really bad time.
[20:12:10] <bd808>	 dancy: yeah
[20:12:14] <dancy>	 That's T392834
[20:12:15] <stashbot>	 T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834
[20:13:09] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org
[20:13:16] <bd808>	 454,813 events for it in logspam-watch
[20:13:23] <dancy>	 oof
[20:13:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:15:04] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774253 (10bd808) There are 454,813 "PHP Notice: fwrite(): write of 63 bytes failed with errno=28 No space left on device" errors in `logspam-watch` right now. It looks like the `extensions/WikimediaMainten...
[20:16:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:16:28] <bd808>	 ok. the logspam looks unrelated to prod wikis, so lets get on with backports
[20:16:55] <mutante>	 do not worry about home dirs as long as we have this:
[20:16:56] <mutante>	 24G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s2
[20:16:56] <mutante>	 25G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s3
[20:17:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P75552 and previous config saved to /var/cache/conftool/dbconfig/20250428-201723-fceratto.json
[20:18:01] * bd808 is about to click SpiderPig's "Start Backport" button for his first time outside of local dev testing
[20:18:09] <dancy>	 Woohoo!
[20:18:14] <dancy>	 🕸️
[20:18:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis)
[20:18:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:18:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang)
[20:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:18:59] <dancy>	 oooh going for a triple 
[20:19:02] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774270 (10Dzahn) It's almost entirely just logs from the growth experiments jobs.  and under /var/log/  ` 24G mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 25G mediawiki_job_growthexperimen...
[20:19:19] <bd808>	 yeah, they are all config only and none of them are really testable
[20:20:01] <wikibugs>	 (03Merged) 10jenkins-bot: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis)
[20:20:04] <wikibugs>	 (03Merged) 10jenkins-bot: Design Research Participant Survey: Increase Coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139474 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:20:07] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang)
[20:20:22] <logmsgbot>	 !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]]
[20:20:29] <stashbot>	 T392142: Office Wiki credentials inexplicably stop working - https://phabricator.wikimedia.org/T392142
[20:20:29] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:20:29] <stashbot>	 T388719: Clean up Search AB test code - https://phabricator.wikimedia.org/T388719
[20:21:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:23:35] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[20:24:15] <logmsgbot>	 jhancock@cumin2002 reimage (PID 1302317) is awaiting input
[20:24:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[20:25:02] <logmsgbot>	 !log bd808@deploy1003 dani, bwang, bd808: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:26:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:27:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2100.codfw.wmnet with reason: host reimage
[20:28:09] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org
[20:28:20] <logmsgbot>	 !log bd808@deploy1003 dani, bwang, bd808: Continuing with sync
[20:30:05] <mutante>	 !log mwmaint1002 - manually gzipped some syslog.1 file from growthexperiment jobs that used up all disk space - systemctl start logrotate T392834
[20:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:11] <stashbot>	 T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834
[20:31:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:31:59] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:32:21] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:32:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T392806)', diff saved to https://phabricator.wikimedia.org/P75553 and previous config saved to /var/cache/conftool/dbconfig/20250428-203230-fceratto.json
[20:32:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[20:32:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2100.codfw.wmnet with reason: host reimage
[20:32:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75554 and previous config saved to /var/cache/conftool/dbconfig/20250428-203255-fceratto.json
[20:34:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2093 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1704, active_shards: 5111, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number
[20:34:22] <icinga-wm>	 ing_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:34:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2093-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[20:34:57] <logmsgbot>	 !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137087|dblists: Add sul.dbexpr and generated sul.dblist (T392142)]], [[gerrit:1139474|Design Research Participant Survey: Increase Coverage (T392325)]], [[gerrit:1138859|Remove Search AB test config (T388719)]] (duration: 14m 34s)
[20:35:03] <stashbot>	 T392142: Office Wiki credentials inexplicably stop working - https://phabricator.wikimedia.org/T392142
[20:35:04] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:35:04] <stashbot>	 T388719: Clean up Search AB test code - https://phabricator.wikimedia.org/T388719
[20:35:37] <bd808>	 danisztls and bwang: Your changes are live on the project wikis
[20:36:07] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774417 (10Tgr) The mediawiki_job_growthexperiments-refreshLinkRecommendations-* logs should be fine to delete, if you are looking for some emergency space savings. It's the output of a job creating seconda...
[20:36:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:37:20] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:37:55] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774427 (10Tgr) Logrotate should probably enforce some default storage quota for jobs.
[20:38:00] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:38:19] <bd808>	 everything looks normal on the error log watching places other than the T392834 stuff that is unrelated to the backports
[20:38:19] <bwang>	 Ok thank you
[20:38:19] <stashbot>	 T392834: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834
[20:38:41] * bd808 declares the backport window closed
[20:40:45] <bd808>	 dancy: a thought for SpiderPig -- what is the `scap backport --revert` story there? I think the answer is use ssh and scap on the cli, but maybe I'm missing something?
[20:41:19] <dancy>	 bd808: Make it easy to revert is on the list of improvements.
[20:41:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:41:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cirrussearch2093:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:42:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75555 and previous config saved to /var/cache/conftool/dbconfig/20250428-204219-fceratto.json
[20:42:51] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774453 (10Dzahn) Thanks for confirming that. I deleted the 2 largest syslog files, from mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 and mediawiki_job_growthexperiments-refreshLinkRecommen...
[20:43:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org
[20:45:12] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774460 (10Dzahn) Stopping the services `mediawiki_job_growthexperiments-refreshLinkRecommendations-s2` and `mediawiki_job_growthexperiments-refreshLinkRecommendations-s3` also does not properly shut them d...
[20:45:37] <mutante>	 tgr_: looks like we'd have to manually kill processes to stop that
[20:45:54] <mutante>	   /bin/sh -c /usr/local/bin/foreachwikiindblist 'growthexperiments & s2'   ....   does not go away 
[20:46:25] <jinxer-wm>	 RESOLVED: [7x] SystemdUnitFailed: export_smart_data_dump.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:53:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2046.codfw.wmnet with OS bookworm
[20:53:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[20:53:43] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm
[20:53:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm
[20:53:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm
[20:53:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[20:53:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err...
[20:53:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm
[20:54:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[20:54:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[20:56:54] <mutante>	 ok, mwmaint1002 disk issue resolved for now
[20:57:14] <mutante>	 had to also restart rsyslogd which kept deleted huge logs open and stuff
[20:57:23] <mutante>	 usage on / back to 60%
[20:57:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P75556 and previous config saved to /var/cache/conftool/dbconfig/20250428-205727-fceratto.json
[20:57:59] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774533 (10Dzahn) Killed the processes for growthexperiments-refreshLinkRecommendations-s2 and growthexperiments-refreshLinkRecommendations-s3.  gzipped more syslog files.  Still not a lot of space.   rsysl...
[20:58:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2100.codfw.wmnet with OS bullseye
[20:58:43] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2100)
[21:02:48] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:03:00] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:03:02] <wikibugs>	 06SRE, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774543 (10Dzahn) ` kill 24015 kill 24047 ` ` systemctl start logrotate .. systemctl start prometheus-dpkg-success-textfile.service .. start prometheus_intel_microcode.service .. systemctl start prometheus-...
[21:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:04:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2046.codfw.wmnet with reason: host reimage
[21:04:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage
[21:05:52] <icinga-wm>	 RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops
[21:08:00] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:08:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2046.codfw.wmnet with reason: host reimage
[21:12:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P75557 and previous config saved to /var/cache/conftool/dbconfig/20250428-211234-fceratto.json
[21:14:21] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org
[21:14:21] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and not (A:eqiad or A:codfw) and A:dnsbox
[21:15:15] <wikibugs>	 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774578 (10Dzahn)
[21:15:50] <wikibugs>	 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774590 (10Dzahn) almost certainly caused by T392834
[21:16:30] <wikibugs>	 06SRE: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774594 (10Dzahn) after mwmaint1002 has some disk space again. now:  ` [mwmaint1002:~] $ sql centralauth ... Welcome to the MariaDB monitor.  Commands end with ; or \g.  `
[21:17:07] <wikibugs>	 06SRE, 06serviceops-radar: Cannot connect to SQL server on mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774595 (10Dzahn)
[21:20:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774597 (10ssingh)
[21:20:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774599 (10ssingh) a:05ssingh→03BCornwall
[21:23:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:23:54] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and A:dnsbox
[21:23:54] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org
[21:24:32] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:25:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774617 (10ssingh) Thanks @RobH. Task assigned to Traffic and hostnames updated. We will take care of the preseed.yaml bit, thanks for the reminder!
[21:26:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:26:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2046.codfw.wmnet with OS bookworm
[21:26:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm completed: - gane...
[21:27:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T392806)', diff saved to https://phabricator.wikimedia.org/P75558 and previous config saved to /var/cache/conftool/dbconfig/20250428-212741-fceratto.json
[21:28:00] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[21:28:00] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:28:00] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:28:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75559 and previous config saved to /var/cache/conftool/dbconfig/20250428-212806-fceratto.json
[21:28:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774626 (10Jhancock.wm)
[21:28:40] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm
[21:28:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err...
[21:31:00] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:31:00] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:31:41] <sbassett>	 Hey all - we have two security patches going out for the window toda.y
[21:34:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774644 (10ssingh) (Scratch that, preseed.yaml is `cp[1-9][0-9][0-9][0-9]` so that's good but we just need to update site.pp)
[21:35:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[21:36:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75560 and previous config saved to /var/cache/conftool/dbconfig/20250428-213601-fceratto.json
[21:36:29] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org
[21:36:29] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and A:dnsbox
[21:36:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10774654 (10BCornwall)
[21:39:56] <wikibugs>	 (03PS1) 10BCornwall: site.pp: Include new codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851)
[21:42:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[21:42:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[21:46:56] <sbassett>	 !log Deployed security fix for T385792
[21:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[21:47:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[21:48:20] <wikibugs>	 (03PS6) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212)
[21:48:20] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1137840/5383/" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[21:48:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:51:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P75561 and previous config saved to /var/cache/conftool/dbconfig/20250428-215107-fceratto.json
[21:51:30] <wikibugs>	 (03PS7) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212)
[21:56:03] <logmsgbot>	 jhancock@cumin2002 reimage (PID 1409724) is awaiting input
[22:03:19] <sbassett>	 !log Deployed security fix for T392276
[22:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P75562 and previous config saved to /var/cache/conftool/dbconfig/20250428-220615-fceratto.json
[22:09:43] <wikibugs>	 06SRE, 06serviceops-radar: Cannot connect to MariaDB server from mwmaint1002 - https://phabricator.wikimedia.org/T392846#10774830 (10Reedy)
[22:13:42] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:19:40] <sbassett>	 !log Deployed security fix for T391343
[22:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T392806)', diff saved to https://phabricator.wikimedia.org/P75563 and previous config saved to /var/cache/conftool/dbconfig/20250428-222122-fceratto.json
[22:21:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[22:21:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75564 and previous config saved to /var/cache/conftool/dbconfig/20250428-222148-fceratto.json
[22:23:37] <wikibugs>	 (03PS6) 10BryanDavis: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480
[22:29:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75566 and previous config saved to /var/cache/conftool/dbconfig/20250428-222946-fceratto.json
[22:30:07] <wikibugs>	 (03CR) 10Bking: [C:03+2] Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[22:31:24] <wikibugs>	 (03CR) 10Bking: [C:03+2] "I built and deployed the deb mentioned in Ebernhardson's comment, so we are good to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[22:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:38:06] <wikibugs>	 (03CR) 10BryanDavis: [C:04-1] "I need help thinking about https://phabricator.wikimedia.org/P75565 and how to handle the logic inversion I am doing in the Beta Cluster w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[22:38:47] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:40:09] <wikibugs>	 (03PS1) 10Bking: Revert "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139565
[22:40:26] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Revert "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139565 (owner: 10Bking)
[22:44:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P75567 and previous config saved to /var/cache/conftool/dbconfig/20250428-224453-fceratto.json
[22:48:42] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:53:42] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:54:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10774902 (10VRiley-WMF) 05Open→03Resolved Dell was onsite today and replaced the motherboard, moved DIMMs around, replaced cables and replaced a CPU. Heres hoping we can finally close this ticket,...
[23:00:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P75568 and previous config saved to /var/cache/conftool/dbconfig/20250428-230001-fceratto.json
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250428T2300)
[23:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:04:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[23:05:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[23:11:24] <wikibugs>	 (03CR) 10BryanDavis: [C:04-1] "Paying more attention, there are currently only 2 Beta Cluster wikis that end up with unexpected config:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[23:13:42] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:15:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T392806)', diff saved to https://phabricator.wikimedia.org/P75569 and previous config saved to /var/cache/conftool/dbconfig/20250428-231508-fceratto.json
[23:15:27] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[23:15:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75570 and previous config saved to /var/cache/conftool/dbconfig/20250428-231534-fceratto.json
[23:30:23] <wikibugs>	 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774930 (10bd808) p:05Unbreak!→03High Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and others to decide...
[23:35:05] <wikibugs>	 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774935 (10Zabe) >>! In T392834#10773349, @elukey wrote: > ` > elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail > 553M /home/ebernhardson > 842M /home/catrope > 1.2G /hom...
[23:39:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571
[23:39:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571 (owner: 10TrainBranchBot)
[23:45:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75571 and previous config saved to /var/cache/conftool/dbconfig/20250428-234542-fceratto.json
[23:47:43] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm
[23:47:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10774945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err...
[23:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:50:42] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139571 (owner: 10TrainBranchBot)
[23:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:54:07] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php enwiki --deletedump /home/zabe/afl_text_table_deletedump/enwiki --dump /home/zabe/afl_text_table_dump/enwiki --sleep 0.5 # T381599
[23:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:11] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599