[00:00:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:04:25] RESOLVED: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:15] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:07:27] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:07:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:08:05] PROBLEM - Check unit status of backup-kdc-database on krb1002 is CRITICAL: CRITICAL: Status of the systemd unit backup-kdc-database https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:08:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:09:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:10:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138943 [00:10:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138943 (owner: 10TrainBranchBot) [00:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:12:11] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:12:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:12:31] (03CR) 10BryanDavis: [C:03+1] Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah) [00:13:11] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:14:07] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:18:36] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10767162 (10Papaul) [00:19:31] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10767164 (10Papaul) [00:20:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:20:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10767165 (10Papaul) [00:26:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:31:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138943 (owner: 10TrainBranchBot) [00:40:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:40:55] (03CR) 10Ssingh: "Looks good, a question around min_grace_sleep." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [00:43:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:45:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:50:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:54:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:55:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:55:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:56:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:57:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:00:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [01:15:00] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:40:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:00:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:04:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:05:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:05:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:19:34] (03PS1) 10Andrew Bogott: Rename profile::openstack::::nova::instance_network_id [puppet] - 10https://gerrit.wikimedia.org/r/1138946 [02:19:34] (03PS1) 10Andrew Bogott: nova-fullstack: switch to the dual-stack network for test VMs. [puppet] - 10https://gerrit.wikimedia.org/r/1138947 [02:21:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138946 (owner: 10Andrew Bogott) [02:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138947 (owner: 10Andrew Bogott) [02:25:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:30:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:31:32] (03PS2) 10Andrew Bogott: Rename profile::openstack::::nova::instance_network_id [puppet] - 10https://gerrit.wikimedia.org/r/1138946 [02:31:32] (03PS2) 10Andrew Bogott: nova-fullstack: switch to the dual-stack network for test VMs. [puppet] - 10https://gerrit.wikimedia.org/r/1138947 [02:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:21] (03PS1) 10Andrew Bogott: cloudlb haproxy.cfg: replace 'forceclose' with 'httpclose' [puppet] - 10https://gerrit.wikimedia.org/r/1138949 [02:55:37] RECOVERY - haproxy process on cloudlb2004-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [03:00:51] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:00:57] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:09] RECOVERY - haproxy alive on cloudlb2004-dev is OK: OK check_alive uptime 355s https://wikitech.wikimedia.org/wiki/HAProxy [03:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:05:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:07:23] (03PS2) 10Andrew Bogott: cloudlb haproxy.cfg: replace 'forceclose' with 'httpclose' [puppet] - 10https://gerrit.wikimedia.org/r/1138949 [03:07:52] (03CR) 10CI reject: [V:04-1] cloudlb haproxy.cfg: replace 'forceclose' with 'httpclose' [puppet] - 10https://gerrit.wikimedia.org/r/1138949 (owner: 10Andrew Bogott) [03:08:11] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:11] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:09:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:09:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:15:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:21:39] (03PS3) 10Andrew Bogott: cloudlb haproxy.cfg: replace 'forceclose' with 'httpclose' [puppet] - 10https://gerrit.wikimedia.org/r/1138949 (https://phabricator.wikimedia.org/T377126) [03:22:58] (03PS1) 10RLazarus: deployment_server: Use ~/.cache/helm if /var/cache/helm isn't writable [puppet] - 10https://gerrit.wikimedia.org/r/1138951 (https://phabricator.wikimedia.org/T378429) [03:25:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:29:58] (03PS2) 10RLazarus: deployment_server: Use ~/.cache/helm if /var/cache/helm isn't writable [puppet] - 10https://gerrit.wikimedia.org/r/1138951 (https://phabricator.wikimedia.org/T378429) [03:30:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:31:04] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10767248 (10Papaul) @Jhancock.wm @herron the issue here is that the switch port is part of vlan private1-d4-codfw (10.192.39.0/24) or the IP address on th... [03:31:13] RECOVERY - Host kafka-logging2005 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [03:31:17] PROBLEM - Kafka Broker Server on kafka-logging2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [03:31:29] PROBLEM - Kafka broker TLS certificate validity on kafka-logging2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [03:32:17] RECOVERY - Kafka Broker Server on kafka-logging2005 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [03:32:29] RECOVERY - Kafka broker TLS certificate validity on kafka-logging2005 is OK: SSL OK - Certificate kafka-logging2005.codfw.wmnet valid until 2026-04-25 03:27:00 +0000 (expires in 364 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [03:33:29] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10767249 (10Papaul) 05Open→03Resolved a:03Papaul [03:39:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on elastic1067:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:45:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:46:17] RECOVERY - Bird Internet Routing Daemon on cloudlb2004-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [03:46:23] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2004-dev is OK: OK: UP (pid=2805980) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [03:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:49:14] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:59] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:08:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:09:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:31] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:19:35] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (krb1002), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:24:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on elastic1067:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:05:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030 T391921', diff saved to https://phabricator.wikimedia.org/P75452 and previous config saved to /var/cache/conftool/dbconfig/20250425-050538-marostegui.json [05:05:44] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:06:16] (03PS1) 10Marostegui: es2030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138952 (https://phabricator.wikimedia.org/T391921) [05:06:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2030.codfw.wmnet with reason: Maintenance [05:07:27] (03CR) 10Marostegui: [C:03+2] es2030: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138952 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [05:12:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75453 and previous config saved to /var/cache/conftool/dbconfig/20250425-051257-root.json [05:15:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:19:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:25:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:28:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75454 and previous config saved to /var/cache/conftool/dbconfig/20250425-052802-root.json [05:37:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1032 to es1 master T391921', diff saved to https://phabricator.wikimedia.org/P75455 and previous config saved to /var/cache/conftool/dbconfig/20250425-053744-marostegui.json [05:37:49] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:38:42] (03PS1) 10Marostegui: wmnet: Promote es1032 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1138956 (https://phabricator.wikimedia.org/T391921) [05:39:32] (03CR) 10Marostegui: [C:03+2] wmnet: Promote es1032 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1138956 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:39:44] !log marostegui@dns1006 START - running authdns-update [05:42:16] !log marostegui@dns1006 END - running authdns-update [05:43:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75456 and previous config saved to /var/cache/conftool/dbconfig/20250425-054308-root.json [05:45:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75457 and previous config saved to /var/cache/conftool/dbconfig/20250425-055813-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250425T0600) [06:00:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:13:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75458 and previous config saved to /var/cache/conftool/dbconfig/20250425-061319-root.json [06:21:49] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:25:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:28:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75459 and previous config saved to /var/cache/conftool/dbconfig/20250425-062824-root.json [06:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:30:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:30:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:30:51] (03CR) 10Muehlenhoff: [C:03+2] Add trixie to the list of supported OSes [cookbooks] - 10https://gerrit.wikimedia.org/r/1138698 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [06:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:40:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:40:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75460 and previous config saved to /var/cache/conftool/dbconfig/20250425-064329-root.json [06:55:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:58:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75461 and previous config saved to /var/cache/conftool/dbconfig/20250425-065834-root.json [06:59:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250425T0700) [07:00:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:01:26] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10767409 (10MoritzMuehlenhoff) 05Resolved→03Open Logins to kafka-logging2005 are still failing, I'm reopening the task. [07:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:10:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:11:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:11:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75462 and previous config saved to /var/cache/conftool/dbconfig/20250425-071339-root.json [07:15:38] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to undelete File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg - https://phabricator.wikimedia.org/T392658#10767420 (10A_smart_kitten) [07:32:32] (03CR) 10Majavah: [C:03+2] cloudlb haproxy.cfg: replace 'forceclose' with 'httpclose' [puppet] - 10https://gerrit.wikimedia.org/r/1138949 (https://phabricator.wikimedia.org/T377126) (owner: 10Andrew Bogott) [07:35:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:36:49] RESOLVED: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:38:25] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [07:38:38] (03CR) 10Marostegui: [C:03+1] Swift: drain ms-be2080 (prep for VLAN move) [puppet] - 10https://gerrit.wikimedia.org/r/1138830 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [07:44:59] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on krb1002.eqiad.wmnet with reason: work in progress, not yet active [07:45:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:50:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet [07:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:53] (03PS1) 10Majavah: cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) [07:59:16] (03CR) 10CI reject: [V:04-1] cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [08:00:12] (03PS2) 10Majavah: cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) [08:00:34] (03CR) 10CI reject: [V:04-1] cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [08:01:26] (03PS3) 10Majavah: cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) [08:04:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5367/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [08:08:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:09:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2032 to es1 master T391921', diff saved to https://phabricator.wikimedia.org/P75463 and previous config saved to /var/cache/conftool/dbconfig/20250425-082420-marostegui.json [08:24:25] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [08:27:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "the haproxy config syntax is suprising, but if it works, then LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [08:29:01] (03CR) 10Majavah: [V:03+1 C:03+2] cloudlb: Bind on IPv6 too when no address has been specified [puppet] - 10https://gerrit.wikimedia.org/r/1138960 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [08:29:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:35:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:39:52] (03CR) 10Jelto: "Thanks for opening the change! I'm not very familiar with how the proxying from the gui frontend to the backend works, but the `custom-con" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138935 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [08:48:13] (03PS1) 10Jelto: gerrit: add ports to hackathon nftables rule [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) [08:49:32] (03CR) 10Jelto: [V:03+1 C:03+1] "I uploaded a small addition regarding the ports in I1a6870bf6c08a3803623b3f0c6432763f593bded" [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [08:50:27] (03CR) 10Majavah: [C:03+1] gerrit: add ports to hackathon nftables rule [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [08:50:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:53:37] (03CR) 10Hashar: "I have put the sudo rules next to the definitions of the systemd timers, to have everything at the same place. That sounded natural." [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [08:53:41] FIRING: [8x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:57:21] (03CR) 10Hashar: "I guess this one can go at anytime, it is not going to affect Apache/Gerrit etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (owner: 10Hashar) [08:58:04] (03CR) 10Jelto: [C:03+2] gerrit: add ports to hackathon nftables rule [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [09:00:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:01:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:05:34] (03CR) 10Jelto: [V:03+1 C:03+2] "nftables is happy with the new change, the rule is listed as" [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [09:05:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:10:42] (03PS1) 10Hashar: gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) [09:10:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:11:50] !log removed cloudlb2001-dev bgp session from cloudsw1-b1-codfw T377126 [09:11:51] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] T377126: replace cloudlb2001-dev with cloudlb2004-dev - https://phabricator.wikimedia.org/T377126 [09:11:57] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:12:21] FIRING: SLOMetricAbsent: wdqs-availability codfw - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:13:06] 06SRE, 06cloud-services-team, 10Cloud-VPS: cloudlb2001-dev and cloudlb2002-dev connected at different speeds - https://phabricator.wikimedia.org/T348173#10767622 (10taavi) 05Open→03Invalid obsolete with {T377126} [09:13:14] (03PS2) 10Hashar: gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (https://phabricator.wikimedia.org/T392669) [09:13:16] (03PS2) 10Hashar: gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) [09:13:55] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [09:14:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:15:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:16:12] !log restarting puppetserver on puppetserver1003 [09:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:21] RESOLVED: [3x] SLOMetricAbsent: wdqs-availability codfw - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:31:59] !log restarting puppetserver on puppetserver1002 (apparently needs a restart which per timing seems related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138904) [09:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:45:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:46:59] (03PS2) 10Muehlenhoff: Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [09:47:50] (03CR) 10Muehlenhoff: "It gets enabled via profile::admin::groups, I have updated the patch accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [09:50:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:50:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:51:44] (03PS4) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) [09:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:55:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:01:34] (03CR) 10Majavah: [C:03+2] openstack: designate: Remove nova_fixed_multi code [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) (owner: 10Majavah) [10:20:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [10:23:39] (03CR) 10MVernon: [C:03+1] restbase: configure restbase104[3-5] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1138838 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [10:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:21] (03PS1) 10Hnowlan: mediawiki::periodic_job: allow for use of a migration title for long job names [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) [10:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:59:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138946 (owner: 10Andrew Bogott) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250425T0700) [11:00:05] jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250425T1100). Please do the needful. [11:00:15] (03PS3) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [11:00:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138947 (owner: 10Andrew Bogott) [11:04:54] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: cleanup pre-IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1138743 (https://phabricator.wikimedia.org/T380174) [11:05:12] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138743 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [11:05:19] (03PS4) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [11:06:58] (03CR) 10Kamila Součková: [C:03+1] deployment_server: Use ~/.cache/helm if /var/cache/helm isn't writable [puppet] - 10https://gerrit.wikimedia.org/r/1138951 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [11:08:25] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138855 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [11:08:35] (03PS5) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [11:09:46] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [11:13:43] (03PS1) 10Jelto: gitlab: use read-only object storage credentials on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) [11:14:32] (03CR) 10Kamila Součková: Rakefile: remove semver-cli requirement (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [11:14:58] (03Abandoned) 10Kamila Součková: Rakefile: remove semver-cli requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [11:19:11] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:19:48] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10767820 (10Jelto) I've done some more tests with gitlab-replica-b.wikimedia.org and the download speed, latency and availability o... [11:20:04] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10767822 (10Jelto) [11:20:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:10] (03CR) 10Kamila Součková: [C:03+1] mediawiki::periodic_job: allow for use of a migration title for long job names [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [11:21:38] (03CR) 10Kamila Součková: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [11:25:26] (03PS1) 10Jelto: gitlab: disable ci_secure_files object storage [puppet] - 10https://gerrit.wikimedia.org/r/1139007 (https://phabricator.wikimedia.org/T378922) [11:25:49] (03Abandoned) 10Kamila Součková: CampaignEvents: Shorten aggregateparticipantanswers name [puppet] - 10https://gerrit.wikimedia.org/r/1138855 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [11:26:20] (03PS2) 10Hnowlan: mediawiki::periodic_job: allow for use of a migration title for long job names [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) [11:29:22] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [11:31:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS trixie [11:31:23] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10767837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1001.eqiad.wmnet with OS trixie [11:36:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10767841 (10MoritzMuehlenhoff) [11:43:49] (03CR) 10Kamila Součková: [C:03+1] mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [11:49:08] (03CR) 10Hnowlan: [C:03+2] mediawiki::periodic_job: allow for use of a migration title for long job names [puppet] - 10https://gerrit.wikimedia.org/r/1139004 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [11:49:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:49:46] (03CR) 10Kamila Součková: "@jwodstrcil@wikimedia.org I'm not familiar enough with debian builds, is this CI build error an actual problem?" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:51:50] (03CR) 10Kamila Součková: [C:03+1] services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [11:53:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10767871 (10phaultfinder) [11:55:21] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: cleanup pre-IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1138743 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [11:56:13] jmm@cumin2002 reimage (PID 583340) is awaiting input [12:07:17] (03PS6) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [12:07:17] (03PS2) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) [12:08:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:09:08] (03PS1) 10Majavah: hieradata: Expand GitLab blocklist for new WMCS IP space [puppet] - 10https://gerrit.wikimedia.org/r/1139016 [12:09:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:10:04] (03CR) 10CI reject: [V:04-1] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [12:16:22] (03PS3) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) [12:20:05] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138827 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [12:20:27] (03CR) 10Jelto: "I think this is a bookworm/bullseye issue. The build works fine on bookworm (I just verified it on `build2002`). I guess the CI is still o" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [12:22:01] (03PS1) 10Majavah: P:toolforge: disable_tool: Use ToolsDB internal IP instead [puppet] - 10https://gerrit.wikimedia.org/r/1139018 (https://phabricator.wikimedia.org/T381272) [12:22:03] (03PS1) 10Majavah: P:wmcs: maintain_dbusers: Use cloud-private for ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1139019 (https://phabricator.wikimedia.org/T381272) [12:25:06] (03PS1) 10Hnowlan: mediawiki::maintenance: migrate main startupregistrystats job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139020 (https://phabricator.wikimedia.org/T388540) [12:27:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:27:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:29:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:29:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:43:41] PROBLEM - Disk space on analytics1073 is CRITICAL: DISK CRITICAL - free space: / 1995 MB (3% inode=95%): /tmp 1995 MB (3% inode=95%): /var/tmp 1995 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [12:46:33] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:46:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:46:37] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:46:42] Emperor: ^^ [12:46:47] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:46:47] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:46:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:47:15] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:47:23] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:25] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:27] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:35] grafana was me I think [12:47:37] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:37] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [12:47:43] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:47:43] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:47:51] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:47:59] grafana is currently down from here [12:48:13] I think that's on me, I input envoy_cluster_upstream_rq_time_bucket in the Explore tab [12:48:45] yeah, I can't eyeball swift graphs ATM [12:49:13] probably just needs a restart [12:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:33] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:51:33] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:51:43] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.493 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:53:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:43] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:54:43] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:55:11] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 6.275 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:55:51] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:55:51] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:57:33] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:57:33] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:57:51] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:15] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:58:51] !log restarting grafana-server.service @ grafana1002.eqiad.wmnet [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:41] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:00:05] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:02:45] PROBLEM - Host kafka-logging2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:35] RECOVERY - Host kafka-logging2005 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [13:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:06:34] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: split the airflow and postgresql deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138827 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [13:08:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS trixie [13:08:31] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10768085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1001.eqiad.wmnet with OS trixie executed with errors: - srete... [13:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [13:21:19] (03CR) 10Andrew Bogott: [C:03+2] Rename profile::openstack::::nova::instance_network_id [puppet] - 10https://gerrit.wikimedia.org/r/1138946 (owner: 10Andrew Bogott) [13:21:25] (03CR) 10Andrew Bogott: [C:03+2] nova-fullstack: switch to the dual-stack network for test VMs. [puppet] - 10https://gerrit.wikimedia.org/r/1138947 (owner: 10Andrew Bogott) [13:22:11] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 722088024 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:26:11] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 55680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:26:27] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5 [13:28:27] (03PS1) 10Cathal Mooney: Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) [13:29:08] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis nupwiki in section s5 [13:29:09] (03CR) 10CI reject: [V:04-1] Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [13:31:17] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [13:32:39] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2081 to cirrussearch2081 [13:32:53] (03CR) 10Ssingh: [C:03+1] "Looks good once the file exists!" [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [13:33:03] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:33:19] !log add cloudlb2004-dev bgp session to cloudsw1-b1-codfw T377126 [13:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:24] T377126: replace cloudlb2001-dev with cloudlb2004-dev - https://phabricator.wikimedia.org/T377126 [13:34:41] !log taavi@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet [13:34:53] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:36:17] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [13:37:57] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:32] (03PS1) 10Muehlenhoff: Add trixie to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/1139037 (https://phabricator.wikimedia.org/T391083) [13:38:44] bking@cumin2002 rename (PID 706591) is awaiting input [13:39:40] (03CR) 10Ssingh: [C:03+1] "Looks good! Want us to merge it?" [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [13:40:53] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:57] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:23] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5 [13:43:42] !log taavi@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet [13:45:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis nupwiki in section s5 [13:46:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [13:46:53] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:46:55] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2003-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:46:57] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:01] PROBLEM - Bird Internet Routing Daemon on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:47:01] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:47:01] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2002-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:47:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2081 to cirrussearch2081 - bking@cumin2002" [13:47:15] (03CR) 10Arnaudb: [C:03+1] "thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:47:41] (03CR) 10Arnaudb: [C:03+1] gitlab: disable ci_secure_files object storage [puppet] - 10https://gerrit.wikimedia.org/r/1139007 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:47:59] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2081 to cirrussearch2081 - bking@cumin2002" [13:48:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:01] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2081 [13:48:24] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2081 [13:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:49:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2081 to cirrussearch2081 [13:51:06] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2081.codfw.wmnet on all recursors [13:51:10] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2081.codfw.wmnet on all recursors [13:51:16] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [13:51:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2081.codfw.wmnet with OS bullseye [13:51:41] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2081 [13:51:55] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2003-dev is OK: OK: UP (pid=3458277) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:52:01] RECOVERY - Bird Internet Routing Daemon on cloudlb2003-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:52:01] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:52:01] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2002-dev is OK: OK: UP (pid=592498) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:52:13] PROBLEM - Bird Internet Routing Daemon on cloudlb2004-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:52:17] PROBLEM - haproxy alive on cloudlb2004-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [13:52:37] PROBLEM - haproxy process on cloudlb2004-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:53:15] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2004-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:54:37] RECOVERY - haproxy process on cloudlb2004-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:54:44] bking@cumin2002 reimage (PID 724557) is awaiting input [13:55:13] RECOVERY - Bird Internet Routing Daemon on cloudlb2004-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:55:15] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2004-dev is OK: OK: UP (pid=16380) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:56:01] PROBLEM - Bird Internet Routing Daemon on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:56:01] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:56:19] Those bird warnings are me stress-testing things in the test cluster, they will clear in just a moment [13:56:23] PROBLEM - haproxy alive on cloudlb2003-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [13:56:37] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:56:37] PROBLEM - haproxy process on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:56:55] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2003-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:57:01] RECOVERY - Bird Internet Routing Daemon on cloudlb2003-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:57:01] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:57:02] you can't silence the BGP alerts but you can downtime the hosts if desired. [13:57:37] RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:57:37] RECOVERY - haproxy process on cloudlb2003-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:57:43] Yeah :( as always what should have been a 5-second test took a little bit of fiddling to get right [13:57:53] sorry about the noise all [13:57:53] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:57:55] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2003-dev is OK: OK: UP (pid=3462246) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:57:57] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:41] (03CR) 10JHathaway: Rakefile: remove semver-cli requirement (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [13:59:17] RECOVERY - haproxy alive on cloudlb2004-dev is OK: OK check_alive uptime 316s https://wikitech.wikimedia.org/wiki/HAProxy [13:59:18] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10768246 (10Jhancock.wm) I can ssh into the server at this time. [13:59:48] !log restart object-replicator on ms-be2089 [13:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:08] (03PS1) 10Jforrester: Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 [14:00:08] (03PS1) 10Jforrester: manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) [14:01:30] (03PS2) 10Jforrester: Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 [14:01:30] (03PS2) 10Jforrester: manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) [14:02:21] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudlb2001-dev.codfw.wmnet [14:02:25] RECOVERY - haproxy alive on cloudlb2003-dev is OK: OK check_alive uptime 327s https://wikitech.wikimedia.org/wiki/HAProxy [14:03:50] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [14:04:35] (03PS1) 10MVernon: swift: add ms-be2089 to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1139046 (https://phabricator.wikimedia.org/T388221) [14:04:56] (03PS1) 10Jforrester: nupwiki: Enable Parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) [14:05:34] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:05:56] andrew@cumin1002 decommission (PID 2698778) is awaiting input [14:06:03] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:07:45] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10768282 (10bking) a:05bking→03Papaul [14:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:08:44] (03CR) 10Ladsgroup: [C:03+1] swift: add ms-be2089 to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1139046 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [14:08:49] (03CR) 10MVernon: [C:03+1] "LGTM. If you've not yet done so, you'll need to grant the gitlab-ro account read-only access to the relevant buckets." [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:09:02] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [14:09:03] (03CR) 10MVernon: [C:03+2] swift: add ms-be2089 to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1139046 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [14:09:36] (03CR) 10Subramanya Sastry: "We need an exclusion for wikisource wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [14:10:08] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2081 - bking@cumin2002" [14:10:39] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2081 - bking@cumin2002" [14:10:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:40] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2081.codfw.wmnet 86.32.192.10.in-addr.arpa 6.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:10:43] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2081.codfw.wmnet 86.32.192.10.in-addr.arpa 6.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:10:44] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2081 [14:11:11] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [14:11:15] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2081 [14:11:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2081 [14:13:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudlb2001-dev.codfw.wmnet [14:14:33] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10768322 (10MatthewVernon) This is good to know, thanks :) [14:16:08] (03CR) 10Subramanya Sastry: [C:03+1] Move wgParserMigrationEnableParsoidDiscussionTools and wgParserMigrationEnableParsoidArticlePages to a dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139038 (owner: 10Jforrester) [14:16:16] (03CR) 10Subramanya Sastry: [C:04-1] manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [14:16:36] (03PS1) 10Andrew Bogott: Remove refs to cloudlb2001-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1139052 (https://phabricator.wikimedia.org/T392686) [14:20:15] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudlb2001-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1139052 (https://phabricator.wikimedia.org/T392686) (owner: 10Andrew Bogott) [14:20:19] (03PS2) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [14:23:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5 [14:23:26] (03PS1) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) [14:25:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2081.codfw.wmnet with reason: host reimage [14:26:24] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis nupwiki in section s5 [14:29:09] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2076 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: green, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1678, active_shards: 5033, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_ [14:29:09] tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:29:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2081.codfw.wmnet with reason: host reimage [14:29:13] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2080 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: green, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1678, active_shards: 5033, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_ [14:29:13] tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:29:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2080 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 57, number_of_data_nodes: 57, discovered_master: True, active_primary_shards: 1356, active_shards: 4180, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending [14:29:13] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.66618979494515 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:29:34] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10768340 (10Andrew) a:05Andrew→03None [14:31:03] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:33:10] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139056 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:33:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:34:09] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2076 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 58, number_of_data_nodes: 58, discovered_master: True, active_primary_shards: 1356, active_shards: 4181, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of_pending [14:34:09] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.6900333810205 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:34:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:28] (03CR) 10Kamila Součková: [C:03+1] "I'm asking because this failure is different as the other change. But yes, CI is on bullseye, so LGTM. Thanks for checking." [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [14:38:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:39:27] RESOLVED: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:39:28] (03CR) 10Anzx: "T390711 should be right task this patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [14:43:41] PROBLEM - Disk space on analytics1073 is CRITICAL: DISK CRITICAL - free space: / 2125 MB (3% inode=95%): /tmp 2125 MB (3% inode=95%): /var/tmp 2125 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [14:46:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2081.codfw.wmnet with OS bullseye [14:49:27] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:59] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:59] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [14:50:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:49] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:29] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10768411 (10Jhancock.wm) 05Open→03Resolved [14:55:16] (03PS2) 10Hnowlan: mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) [15:01:22] (03CR) 10Vgutierrez: [C:03+1] "let's move forward with this on Monday and sorry about the delay" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [15:03:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:03:49] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [15:06:18] (03PS3) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [15:06:24] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis nupwiki in section s5 [15:07:04] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis nupwiki in section s5 [15:07:12] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:40] !log dancy@deploy1003 Installing scap version "4.156.0" for 2 host(s) [15:10:10] !log dancy@deploy1003 Cancelled [15:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:56] fceratto@cumin1002 sanitize-wiki (PID 2763596) is awaiting input [15:12:34] (03CR) 10Jforrester: "No, see the parents; this should be a in-creation step, not a post-creation step." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [15:13:07] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10768486 (10herron) Can confirm topics have rebalanced as well. Thanks! [15:14:38] (03PS3) 10Jforrester: manage-dblist: Default all new wikis to parsoidrendered [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) [15:14:38] (03CR) 10Jforrester: manage-dblist: Default all new wikis to parsoidrendered (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [15:14:38] (03PS2) 10Jforrester: nupwiki: Enable Parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) [15:16:09] (03PS1) 10Kamila Součková: GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) [15:16:34] (03CR) 10CI reject: [V:04-1] GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [15:18:58] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [15:19:39] (03PS2) 10Kamila Součková: GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) [15:19:41] (03PS1) 10Hnowlan: mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) [15:20:09] (03CR) 10CI reject: [V:04-1] mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [15:20:59] (03CR) 10Hnowlan: [C:03+1] GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [15:21:35] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [15:22:45] (03PS2) 10Hnowlan: mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) [15:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 3.096% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:23:41] FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:57] (03CR) 10Kamila Součková: [C:03+1] mw:maintenance: migrate mediamoderation-updateMetrics to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139080 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [15:25:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.485s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:26:28] (03CR) 10Eevans: [C:03+2] restbase: configure restbase104[3-5] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1138838 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:27:12] !log dancy@deploy1003 Installing scap version "4.157.0" for 2 host(s) [15:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 3.598% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:28:41] FIRING: [7x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:00] !log dancy@deploy1003 Installation of scap version "4.157.0" completed for 2 hosts [15:30:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 4.649s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [15:38:39] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [15:38:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [15:43:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:02] (03CR) 10MVernon: [C:03+1] adjust hosts lists to reflect changes in restbase cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138854 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:48:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:49:00] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1138951 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [15:49:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:19] !log dancy@deploy1003 Installing scap version "4.157.1" for 2 host(s) [15:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:06] !log dancy@deploy1003 Installation of scap version "4.157.1" completed for 2 hosts [15:55:00] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2083 to cirrussearch2083 [15:55:11] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:55:43] (03CR) 10Krinkle: admin: Remove platform-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [15:58:55] (03CR) 10Krinkle: admin: Remove platform-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [15:58:55] (03PS3) 10Krinkle: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 [15:59:21] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2083 to cirrussearch2083 - bking@cumin2002" [15:59:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:39] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2083 to cirrussearch2083 - bking@cumin2002" [15:59:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2083 [15:59:54] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2083 [15:59:55] (03CR) 10Krinkle: [C:03+1] NetworkProbeLimit: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) (owner: 10CDanis) [16:00:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2083 to cirrussearch2083 [16:02:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2083.codfw.wmnet on all recursors [16:02:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2083.codfw.wmnet on all recursors [16:03:13] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2083.codfw.wmnet with OS bullseye [16:03:24] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2083 [16:03:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:07:45] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2083 - bking@cumin2002" [16:07:51] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2083 - bking@cumin2002" [16:07:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:51] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2083.codfw.wmnet 88.32.192.10.in-addr.arpa 8.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2083.codfw.wmnet 88.32.192.10.in-addr.arpa 8.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:56] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2083 [16:08:07] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2083 [16:08:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2083 [16:18:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:22:28] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2083.codfw.wmnet with reason: host reimage [16:24:24] (03PS1) 10Brouberol: an-launcher: disable gobblin webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1139093 (https://phabricator.wikimedia.org/T390249) [16:24:24] (03PS1) 10Brouberol: an-launcher: disable gobblin webrequest_frontend [puppet] - 10https://gerrit.wikimedia.org/r/1139094 (https://phabricator.wikimedia.org/T390249) [16:24:26] (03PS1) 10Brouberol: an-launcher: disable gobblin netflow [puppet] - 10https://gerrit.wikimedia.org/r/1139095 (https://phabricator.wikimedia.org/T390249) [16:24:28] (03PS1) 10Brouberol: an-launcher: disable gobblin event_default [puppet] - 10https://gerrit.wikimedia.org/r/1139096 (https://phabricator.wikimedia.org/T390249) [16:24:30] (03PS1) 10Brouberol: an-launcher: disable gobblin eventlogin_leacy [puppet] - 10https://gerrit.wikimedia.org/r/1139097 (https://phabricator.wikimedia.org/T390249) [16:26:03] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139093 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [16:26:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2083.codfw.wmnet with reason: host reimage [16:26:15] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139094 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [16:26:36] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139095 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [16:26:49] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139096 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [16:27:05] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139097 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [16:27:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [16:31:15] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [16:31:31] FIRING: ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:57] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10768814 (10jhathaway) The timer file is now valid, but it is still not firing on puppetserver2002.codfw.wmnet: ` $ sudo systemctl status sync-puppet-ca.timer ● sync-puppet-ca.timer - Periodic execution of sync-puppet-ca.service... [16:36:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:19] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:38:15] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:38:15] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:41:15] (03PS4) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [16:41:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:48] 07Puppet: puppetserver: CRL updates not picked up by compile only hosts - https://phabricator.wikimedia.org/T392709 (10jhathaway) 03NEW [16:42:03] 07Puppet: puppetserver: CRL updates not picked up by compile only hosts - https://phabricator.wikimedia.org/T392709#10768870 (10jhathaway) p:05Triage→03Medium [16:44:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [16:45:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2083.codfw.wmnet with OS bullseye [16:46:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:47:29] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [16:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:06] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis nupwiki in section s5 [16:56:07] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis nupwiki in section s5 [16:58:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2078.codfw.wmnet with OS bullseye [17:03:11] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2084 to cirrussearch2084 [17:03:23] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:07:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2084 to cirrussearch2084 - bking@cumin2002" [17:08:04] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2084 to cirrussearch2084 - bking@cumin2002" [17:08:04] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:08:05] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2084 [17:09:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:10:17] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:10:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:10:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:08] bking@cumin2002 rename (PID 929524) is awaiting input [17:12:31] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2084 [17:13:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2084 to cirrussearch2084 [17:13:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2084.codfw.wmnet on all recursors [17:13:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2084.codfw.wmnet on all recursors [17:14:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2084.codfw.wmnet with OS bullseye [17:14:22] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2084 [17:14:48] inflatador: ^ so yeah, this awaiting input thing is quite handy IMO [17:15:01] and while it doesn't solve all problems, I hope it solves some of the things we talked about re: human input [17:17:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:17:25] bking@cumin2002 reimage (PID 940120) is awaiting input [17:18:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:20:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:45] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10769015 (10ArthurPSmith) Problem is still there - I just created P13478 (first item-v... [17:38:34] (03CR) 10Brouberol: [C:03+2] an-launcher: disable gobblin webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1139093 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [17:39:36] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10769067 (10Papaul) @bking can we resolve this now? [17:40:18] Is it just me or is gerrit running slowly? [17:41:31] FIRING: ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:46:25] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:46:31] RESOLVED: ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:49:17] (03PS6) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [17:49:21] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:49:21] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:49:26] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [17:50:24] (03CR) 10Brouberol: [C:03+2] an-launcher: disable gobblin webrequest_frontend [puppet] - 10https://gerrit.wikimedia.org/r/1139094 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [17:51:39] (03CR) 10Brouberol: [C:03+1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [17:51:41] (03CR) 10Brouberol: [C:03+2] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [17:53:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:28] (03CR) 10Brouberol: [C:03+2] an-launcher: disable gobblin netflow [puppet] - 10https://gerrit.wikimedia.org/r/1139095 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [18:00:30] (03CR) 10Brouberol: [C:03+2] an-launcher: disable gobblin event_default [puppet] - 10https://gerrit.wikimedia.org/r/1139096 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [18:00:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:33] (03CR) 10Brouberol: [C:03+2] an-launcher: disable gobblin eventlogin_leacy [puppet] - 10https://gerrit.wikimedia.org/r/1139097 (https://phabricator.wikimedia.org/T390249) (owner: 10Brouberol) [18:08:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:08:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:11:48] (03CR) 10Subramanya Sastry: [C:03+1] "Sounds good. Once we have proofreadpage working with parsoid, we can take this out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139039 (https://phabricator.wikimedia.org/T376827) (owner: 10Jforrester) [18:12:59] (03CR) 10Subramanya Sastry: "and rkiwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [18:13:21] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2084 - bking@cumin2002" [18:13:27] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2084 - bking@cumin2002" [18:13:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:13:28] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2084.codfw.wmnet 56.48.192.10.in-addr.arpa 6.5.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:13:31] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2084.codfw.wmnet 56.48.192.10.in-addr.arpa 6.5.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:13:32] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2084 [18:14:00] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10769144 (10bking) @Papaul Unfortunately, the reimage failed again. I'm here for a few more hours and will keep trying. I... [18:14:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2084 [18:14:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2084 [18:16:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [18:16:56] (03CR) 10Jforrester: "Per `git log --topo-order --oneline --since 2024-11-01 dblists/all.dblist` the new wikis are:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [18:17:48] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10769147 (10bking) [18:18:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [18:18:36] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [18:18:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [18:23:21] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [18:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:24:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [18:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:28:29] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2084.codfw.wmnet with reason: host reimage [18:29:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:30:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:31:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2084.codfw.wmnet with reason: host reimage [18:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:41:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:43:03] (03CR) 10Subramanya Sastry: [C:03+1] nupwiki: Enable Parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139047 (https://phabricator.wikimedia.org/T390384) (owner: 10Jforrester) [18:43:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS bullseye [18:43:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10769254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host relforge101... [18:57:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2084.codfw.wmnet with OS bullseye [18:57:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:03:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:09:40] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [19:11:51] (03PS1) 10Mforns: Add file and filetypes tables to the mediawiki-not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) [19:12:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [19:15:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:21:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:26:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:27:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:27:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1010.eqiad.wmnet with OS bullseye [19:28:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10769354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host relforge1010.eq... [19:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10769355 (10Jclark-ctr) [19:30:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10769357 (10Jclark-ctr) 05Open→03Resolved @bking thanks for patience with this one. just wrapped up final server [19:32:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:35:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:20] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [19:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:54] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:00:29] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:00:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:25] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:25] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:03:05] (03PS1) 10JHathaway: puppetserver: update sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) [20:05:00] (03PS2) 10JHathaway: puppetserver: update sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) [20:07:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139120 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway) [20:08:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:11:49] jhancock@cumin2002 provision (PID 1118955) is awaiting input [20:25:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:32:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [20:32:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10769451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [20:32:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:34:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2086 to cirrussearch2086 [20:34:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:35:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10769459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [20:37:04] (03PS2) 10Bernard Wang: Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) [20:37:16] (03CR) 10Bernard Wang: Remove Search AB test config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [20:38:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [20:40:37] bking@cumin2002 rename (PID 1145505) is awaiting input [20:45:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10769498 (10bking) [20:46:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:47:05] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [20:49:25] FIRING: [4x] SystemdUnitFailed: backup-kdc-database.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:38] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10769524 (10bking) I've tried again a few times with `sudo cookbook sre.hosts.reimage --new --os bullseye cirrussearch2... [20:50:51] (03CR) 10BCornwall: [C:03+1] P:durum: add conditional to enable ECH (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [20:52:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2086 to cirrussearch2086 - bking@cumin2002" [20:53:32] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2086 to cirrussearch2086 - bking@cumin2002" [20:53:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:53:33] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2086 [20:53:43] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2086 [20:54:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2086 to cirrussearch2086 [20:57:25] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2086.codfw.wmnet with OS bullseye [20:58:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2086.codfw.wmnet with OS bullseye [21:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:05:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:10:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:16] (03PS4) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [21:12:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [21:17:26] (03CR) 10BCornwall: [C:03+1] gerrit: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [21:18:07] (03CR) 10Kimberly Sarabia: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [21:18:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:18:29] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:18:44] (03CR) 10BCornwall: [C:03+1] Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [21:19:25] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:19:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:19:59] (03CR) 10Dzahn: "I like the idea to make comments searchable but I am wondering about access to the Google search console itself.. given the long history w" [dns] - 10https://gerrit.wikimedia.org/r/1138996 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [21:40:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:38] we are aware of the intermittent gerrit alerts [21:45:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:53:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2086.codfw.wmnet with OS bullseye [21:53:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2086 [21:56:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:02:00] bking@cumin2002 reimage (PID 1224546) is awaiting input [22:05:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:27] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:07:33] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:09:29] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:41:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139141 [23:41:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139141 (owner: 10TrainBranchBot) [23:41:31] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:42:33] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:43] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139141 (owner: 10TrainBranchBot)