[00:00:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:02:27] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:03:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:05:09] PROBLEM - Disk space on an-worker1132 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 217195 MB (5% inode=99%): /var/lib/hadoop/data/h 218701 MB (5% inode=99%): /var/lib/hadoop/data/l 235697 MB (6% inode=99%): /var/lib/hadoop/data/b 227429 MB (6% inode=99%): /var/lib/hadoop/data/j 194350 MB (5% inode=99%): /var/lib/hadoop/data/g 179466 MB (4% inode=99%): /var/lib/hadoop/data/e 221498 MB (5% inode=99%): /var/lib/hadoop/data [00:05:09] 9 MB (6% inode=99%): /var/lib/hadoop/data/d 231553 MB (6% inode=99%): /var/lib/hadoop/data/f 185759 MB (4% inode=99%): /var/lib/hadoop/data/i 249410 MB (6% inode=99%): /var/lib/hadoop/data/k 143061 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1132&var-datasource=eqiad+prometheus/ops [00:06:25] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170 (10Eevans) 03NEW [00:06:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:07:42] (03CR) 10Dzahn: [C:03+2] spiderpig: Set global_cert_name on deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [00:10:45] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137091 [00:10:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137091 (owner: 10TrainBranchBot) [00:12:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P75176 and previous config saved to /var/cache/conftool/dbconfig/20250417-001235-fceratto.json [00:21:20] (03CR) 10Scott French: [C:03+1] "Thanks for updating this, @brouberol@wikimedia.org - LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [00:27:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75177 and previous config saved to /var/cache/conftool/dbconfig/20250417-002743-fceratto.json [00:27:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:32:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137091 (owner: 10TrainBranchBot) [00:34:48] (03CR) 10Scott French: [C:03+1] "Nice! I'm definitely a fan of uniformity for extremely common uses like this." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 (owner: 10Volans) [00:48:50] (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [00:51:09] PROBLEM - Disk space on an-worker1089 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 137988 MB (3% inode=99%): /var/lib/hadoop/data/m 135955 MB (3% inode=99%): /var/lib/hadoop/data/f 239828 MB (6% inode=99%): /var/lib/hadoop/data/c 223766 MB (5% inode=99%): /var/lib/hadoop/data/e 203896 MB (5% inode=99%): /var/lib/hadoop/data/g 247022 MB (6% inode=99%): /var/lib/hadoop/data/j 255801 MB (6% inode=99%): /var/lib/hadoop/data [00:51:09] 4 MB (3% inode=99%): /var/lib/hadoop/data/d 248193 MB (6% inode=99%): /var/lib/hadoop/data/b 161115 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [00:55:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [00:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:29:05] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182 (10Eevans) 03NEW [01:32:26] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10750740 (10Eevans) [01:44:53] RECOVERY - Disk space on an-worker1163 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1163&var-datasource=eqiad+prometheus/ops [01:45:09] RECOVERY - Disk space on an-worker1132 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1132&var-datasource=eqiad+prometheus/ops [01:45:47] RECOVERY - Disk space on an-worker1114 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1114&var-datasource=eqiad+prometheus/ops [01:49:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:53:37] RECOVERY - Disk space on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1088&var-datasource=eqiad+prometheus/ops [02:03:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:14:08] (03CR) 10Ryan Kemper: [C:03+2] cirrussearch: Add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1137086 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [02:16:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:17:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:19:47] (03PS1) 10Ryan Kemper: cirrussearch: fix some omega vs psi assignments [puppet] - 10https://gerrit.wikimedia.org/r/1137093 [02:21:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:21:57] (03PS2) 10Ryan Kemper: cirrussearch: fix some omega vs psi assignments [puppet] - 10https://gerrit.wikimedia.org/r/1137093 (https://phabricator.wikimedia.org/T388610) [02:22:03] !log [samtar@mwmaint1002 ~]$ mwscript maintenance/cleanupTitles.php --wiki=shwiktionary # `Razgovor:Vikirečnik:Srpskohrvatski` [02:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:42] (03CR) 10Ryan Kemper: [C:03+2] cirrussearch: fix some omega vs psi assignments [puppet] - 10https://gerrit.wikimedia.org/r/1137093 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [02:22:45] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:34:22] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [02:34:27] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [02:34:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [02:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10750786 (10phaultfinder) [02:36:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:47:30] (03PS6) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [02:48:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [02:48:09] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [03:20:25] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [03:54:10] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [03:54:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [04:03:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:06:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:36:31] FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:50:03] (03PS3) 10RLazarus: helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [04:50:05] (03PS1) 10Arnaudb: ldap: gerrit admin rights [puppet] - 10https://gerrit.wikimedia.org/r/1137102 (https://phabricator.wikimedia.org/T392186) [04:50:05] (03CR) 10Arnaudb: "this is to have a consensus to my group membership" [puppet] - 10https://gerrit.wikimedia.org/r/1137102 (https://phabricator.wikimedia.org/T392186) (owner: 10Arnaudb) [04:54:12] (03PS3) 10RLazarus: helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [04:55:30] (03CR) 10RLazarus: helmfile_namespaces: Merge hiera services with admin_ng namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [04:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:56:31] RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:58:15] (03PS4) 10RLazarus: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [04:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:04] !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [05:03:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gerrit.failover (exit_code=0) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:34] !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit2002.wikimedia.org to gerrit2003.wikimedia.org [05:27:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gerrit.failover (exit_code=0) from gerrit2002.wikimedia.org to gerrit2003.wikimedia.org [05:28:56] !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit2002.wikimedia.org to gerrit2003.wikimedia.org [05:29:06] (those are tests, sorry for the spam) [05:30:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gerrit.failover (exit_code=0) from gerrit2002.wikimedia.org to gerrit2003.wikimedia.org [05:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10750875 (10phaultfinder) [05:36:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:36:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:50:41] 06SRE, 06collaboration-services, 10Gerrit, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Gerrit admin to arnaudb - https://phabricator.wikimedia.org/T392186#10750891 (10ABran-WMF) [05:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:51:11] (03PS1) 10Arnaudb: gerrit: switchover to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) [05:51:35] (03PS1) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T0600). [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:35] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [06:39:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:48:13] (03CR) 10Brouberol: [C:03+1] "Ping me when you want to apply!" [puppet] - 10https://gerrit.wikimedia.org/r/1137067 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [06:58:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:17:35] (03PS1) 10Brouberol: admin_ng: add the resourcequotas metrics collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137189 (https://phabricator.wikimedia.org/T392193) [07:35:14] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: add the resourcequotas metrics collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137189 (https://phabricator.wikimedia.org/T392193) (owner: 10Brouberol) [07:35:38] (03CR) 10Volans: [C:03+2] doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984 (owner: 10Volans) [07:35:46] (03CR) 10Volans: [C:03+2] spicerack: enable IRC notification on user input [puppet] - 10https://gerrit.wikimedia.org/r/1136973 (owner: 10Volans) [07:39:33] (03CR) 10Fabfur: [C:03+2] wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) (owner: 10Fabfur) [07:40:40] (03CR) 10Brouberol: [C:03+2] admin_ng: add the resourcequotas metrics collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137189 (https://phabricator.wikimedia.org/T392193) (owner: 10Brouberol) [07:41:14] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:41:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:41:41] (03CR) 10Alexandros Kosiaris: [C:04-1] "Finally circling back to this, I now see the light. You are right. However, I think that fully commenting them out is probably not what we" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [07:41:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:44:21] (03CR) 10Jelto: "looking mostly good, one comment in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:45:23] (03Merged) 10jenkins-bot: doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984 (owner: 10Volans) [07:46:26] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:46:35] volans@cumin2002 downtime (PID 1232635) is awaiting input [07:48:12] (03PS4) 10Fabfur: cache: use fqdn in syslog hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) [07:49:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:50:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [07:52:04] (03PS19) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T391457) [07:52:13] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#10751109 (10Tgr) Pre-Cassandra, local sessions had one hour expiry, and central sessions had 24 hour expiry. Since Kask has per-namespace expiry... [07:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:37] (03CR) 10Vgutierrez: [C:04-1] cache: use fqdn in syslog hostname (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [07:59:38] (03PS2) 10Elukey: modules: comment out gatewayHosts->domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) [08:00:15] (03CR) 10Elukey: "It does yes, I also tested it in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133389 and the CI's diff is what I expec" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [08:01:28] (03CR) 10Elukey: [C:03+2] profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [08:03:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:05:53] (03PS1) 10Elukey: role::ml_k8s::master: move ml-serve-ctrl1002 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1137210 (https://phabricator.wikimedia.org/T387854) [08:06:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:09:00] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5312/" [puppet] - 10https://gerrit.wikimedia.org/r/1137210 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:09:45] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::master: move ml-serve-ctrl1002 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1137210 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:12:48] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve-ctrl1002.eqiad.wmnet with OS bookworm [08:14:13] (03CR) 10Jelto: "I'm not sure why access to `deployment` group is needed. You should be able to run any deployment/scap commands with your SRE privileges. " [puppet] - 10https://gerrit.wikimedia.org/r/1137102 (https://phabricator.wikimedia.org/T392186) (owner: 10Arnaudb) [08:15:41] PROBLEM - Host ms-be1091 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:01] this is me --^ [08:16:09] RECOVERY - Host ms-be1091 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [08:17:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:17:53] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:20:26] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [08:22:05] PROBLEM - Host ms-be1091 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:09] RECOVERY - Host ms-be1091 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [08:24:49] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:27:56] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage [08:31:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage [08:34:28] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1091.eqiad.wmnet with OS bullseye [08:34:44] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [08:34:50] (03CR) 10Hashar: "This adds you to the `deployment` group which is unrelated to Gerrit?" [puppet] - 10https://gerrit.wikimedia.org/r/1137102 (https://phabricator.wikimedia.org/T392186) (owner: 10Arnaudb) [08:37:23] (03PS5) 10Fabfur: cache: use fqdn in haproxykafka hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) [08:39:35] (03CR) 10CI reject: [V:04-1] cache: use fqdn in haproxykafka hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [08:40:19] (03CR) 10Elukey: [C:03+2] alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [08:44:42] (03Abandoned) 10Arnaudb: ldap: gerrit admin rights [puppet] - 10https://gerrit.wikimedia.org/r/1137102 (https://phabricator.wikimedia.org/T392186) (owner: 10Arnaudb) [08:45:21] 06SRE, 06collaboration-services, 10Gerrit, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Gerrit admin to arnaudb - https://phabricator.wikimedia.org/T392186#10751198 (10hashar) > it is useful to watch replication +1 For `gerrit show-queue`, that requires the {nav View Queue} permission,, but I thin... [08:49:07] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [08:50:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet with OS bookworm [08:52:34] (03PS6) 10Fabfur: cache: use fqdn in haproxykafka hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) [08:52:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [08:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:55:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [08:56:25] (03PS1) 10Jelto: mailman: add MailmanBounceQueueHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1137212 (https://phabricator.wikimedia.org/T391330) [08:58:23] (03CR) 10Fabfur: cache: use fqdn in haproxykafka hostname (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [08:58:24] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1091.eqiad.wmnet with OS bullseye [08:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:17] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@1e9e1f9]: bump image suggestions to 1.5.0 [09:14:50] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@1e9e1f9]: bump image suggestions to 1.5.0 (duration: 01m 54s) [09:20:57] (03CR) 10Volans: [C:03+2] ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 (owner: 10Volans) [09:27:42] (03PS1) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) [09:28:02] (03Merged) 10jenkins-bot: ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 (owner: 10Volans) [09:28:24] (03CR) 10CI reject: [V:04-1] prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) (owner: 10Majavah) [09:29:11] (03PS2) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) [09:43:00] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:43:56] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:49:00] (03PS1) 10Majavah: P:wmcs: toolsdb: Stop wmf-mariadb106 from auto-upgrading [puppet] - 10https://gerrit.wikimedia.org/r/1137224 (https://phabricator.wikimedia.org/T385885) [09:49:53] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5313/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137224 (https://phabricator.wikimedia.org/T385885) (owner: 10Majavah) [09:56:25] (03CR) 10Filippo Giunchedi: [C:03+1] mailman: add MailmanBounceQueueHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1137212 (https://phabricator.wikimedia.org/T391330) (owner: 10Jelto) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1000) [10:03:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:03:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:04:43] (03CR) 10Arnaudb: [C:03+1] "Thanks for the new alert! that will be handy to keep up with this issue" [alerts] - 10https://gerrit.wikimedia.org/r/1137212 (https://phabricator.wikimedia.org/T391330) (owner: 10Jelto) [10:04:49] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10751359 (10phaultfinder) [10:08:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751363 (10elukey) Tried to reimage but indeed it fails for the swift facts being inconsistent. We'll need to fix them :( I sorted ou... [10:09:50] (03CR) 10FNegri: "Thanks, this looks good! I have a test docker container I'm using for testing, let me verify if this works there before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1137224 (https://phabricator.wikimedia.org/T385885) (owner: 10Majavah) [10:16:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751379 (10MatthewVernon) I think `swift_facts` is broadly correct, the problem is in `configure_disks.pp`: ` $facts['swift_disks... [10:18:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751386 (10MatthewVernon) We solved similar elsewhere in that file (where SM and Dell were a bit different) with a regex, but I'm not... [10:29:38] (03CR) 10Jelto: [C:03+2] mailman: add MailmanBounceQueueHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1137212 (https://phabricator.wikimedia.org/T391330) (owner: 10Jelto) [10:30:49] (03PS1) 10Elukey: profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) [10:30:53] (03Merged) 10jenkins-bot: mailman: add MailmanBounceQueueHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1137212 (https://phabricator.wikimedia.org/T391330) (owner: 10Jelto) [10:32:16] (03PS2) 10Elukey: profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) [10:34:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10751453 (10phaultfinder) [10:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:50] (03CR) 10Elukey: [C:04-1] "Doesn't work yet" [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [10:38:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751458 (10elukey) I started with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137243 but it doesn't work due to the fact that we have two "exp... [10:38:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751459 (10MatthewVernon) Thank you so much for looking at this! Happy to do review once you've something working [10:43:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10751473 (10MatthewVernon) >>! In T391854#10751458, @elukey wrote: > I started with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137243 but it d... [10:44:10] (03PS5) 10Clément Goubert: updatequerypages: Move to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [10:52:21] (03PS1) 10Hashar: gerrit: do not replicate apps/ [puppet] - 10https://gerrit.wikimedia.org/r/1137256 (https://phabricator.wikimedia.org/T392198) [10:57:26] (03PS4) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [10:58:33] (03PS5) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [10:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:00:00] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) (owner: 10Fabfur) [11:04:54] (03PS6) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:05:35] (03CR) 10CI reject: [V:04-1] updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:09:05] (03PS7) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:10:03] (03PS1) 10Ladsgroup: maintain-views: Drop views on ipblocks* [puppet] - 10https://gerrit.wikimedia.org/r/1137262 (https://phabricator.wikimedia.org/T390767) [11:12:54] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:22:33] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751568 (10Ladsgroup) ms1065 and ms1066 are at 93-94% and growing. They probably gonna alert during Easter which would be suboptimal.... [11:28:26] (03PS8) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:29:25] (03PS9) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:29:49] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:34:07] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum overlarge container dbs [11:34:18] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751588 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8f4b0824-0524-4e18-bc89-2288516b7a58) set by ladsgroup@cum... [11:35:27] (03CR) 10Arnaudb: [C:03+1] "good catch @hashar@free.fr!" [puppet] - 10https://gerrit.wikimedia.org/r/1137256 (https://phabricator.wikimedia.org/T392198) (owner: 10Hashar) [11:36:25] (03PS10) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:40:39] (03CR) 10Ladsgroup: [C:03+1] "We can probably start using it in many configs now. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [11:41:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:42:35] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:44:46] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10751616 (10phaultfinder) [11:45:09] (03PS11) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [11:45:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:45:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:45:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T391056)', diff saved to https://phabricator.wikimedia.org/P75178 and previous config saved to /var/cache/conftool/dbconfig/20250417-114551-fceratto.json [11:45:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:49:04] (03PS1) 10Ladsgroup: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 [11:49:18] 06SRE, 06collaboration-services, 10Gerrit, 10LDAP-Access-Requests: Grant Gerrit admin to arnaudb - https://phabricator.wikimedia.org/T392186#10751624 (10MatthewVernon) @LSobanski can you approve @ABran-WMF's addition to the `gerritadmin` LDAP group, please? LDAP access changes like this need manager approval. [11:49:52] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [11:52:10] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751627 (10MatthewVernon) Thanks. If you need any help, do shout! [11:52:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T391056)', diff saved to https://phabricator.wikimedia.org/P75179 and previous config saved to /var/cache/conftool/dbconfig/20250417-115221-fceratto.json [11:52:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:53:01] (03PS2) 10Ladsgroup: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 [11:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:48] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [11:54:44] (03CR) 10Kamila Součková: [C:03+1] PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [11:56:07] (03CR) 10Kamila Součková: [C:03+1] helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:57:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum overlarge container dbs [11:58:09] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751634 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f13ae1c1-83c8-4385-a3df-20f230d1fd31) set by ladsgroup@cum... [11:59:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1200) [12:02:25] 06SRE, 06collaboration-services, 10Gerrit, 10LDAP-Access-Requests: Grant Gerrit admin to arnaudb - https://phabricator.wikimedia.org/T392186#10751641 (10LSobanski) Approved. [12:03:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:06:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:07:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P75180 and previous config saved to /var/cache/conftool/dbconfig/20250417-120728-fceratto.json [12:14:33] (03PS1) 10Gehel: refactor(opensearch): use Netbox to get rack / row information [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) [12:14:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10751659 (10VRiley-WMF) These have been added into netbox with their information [12:15:29] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [12:15:32] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [12:16:48] (03PS3) 10Ladsgroup: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 [12:17:35] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [12:19:13] (03PS4) 10Ladsgroup: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 [12:20:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10751686 (10VRiley-WMF) After looking at this ticket, is it safe to decomm lvs1016 and move it to the the new location? If I'm understanding this correctly, you would like it to be locate... [12:22:34] FYI, I’ll be somewhat late to the afternoon backport window [12:22:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P75181 and previous config saved to /var/cache/conftool/dbconfig/20250417-122235-fceratto.json [12:22:39] (though so far there are no patches for it anyways ^^) [12:24:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1066.eqiad.wmnet [12:24:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1066.eqiad.wmnet [12:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10751692 (10phaultfinder) [12:25:55] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1065.eqiad.wmnet with reason: vacuum overlarge container dbs [12:26:00] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6167d42b-b0fe-4447-a901-ec7f6cf153c8) set by ladsgroup@cum... [12:29:48] (03PS2) 10Gehel: refactor(opensearch): use Netbox to get rack / row information [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) [12:30:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1180.eqiad.wmnet with OS bullseye [12:30:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10751696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1180.eqiad.wmnet with OS b... [12:31:01] (03CR) 10Majavah: [C:03+1] "somehow I thought we did this already. few things inline but nothing big" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [12:33:58] 06SRE, 06collaboration-services, 10Gerrit, 10LDAP-Access-Requests: Grant Gerrit admin to arnaudb - https://phabricator.wikimedia.org/T392186#10751708 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done! [12:36:37] (03PS8) 10Clément Goubert: sharded_periodic_jobs: Kubernetes CronJob compat [puppet] - 10https://gerrit.wikimedia.org/r/1137227 (https://phabricator.wikimedia.org/T341555) [12:36:41] (03PS13) 10Clément Goubert: updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) [12:36:52] (03PS9) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) [12:37:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T391056)', diff saved to https://phabricator.wikimedia.org/P75182 and previous config saved to /var/cache/conftool/dbconfig/20250417-123742-fceratto.json [12:37:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:37:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: Maintenance [12:38:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T391056)', diff saved to https://phabricator.wikimedia.org/P75183 and previous config saved to /var/cache/conftool/dbconfig/20250417-123804-fceratto.json [12:39:39] jouncebot: nowandnext [12:39:39] For the next 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1200) [12:39:39] In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1300) [12:39:56] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: set upgradeMode to savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853) (owner: 10DCausse) [12:41:26] (03Merged) 10jenkins-bot: cirrus-streaming-updater: set upgradeMode to savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853) (owner: 10DCausse) [12:42:13] (03CR) 10Clément Goubert: "Only actual diff to system resources are for `s11` because of the force `absent` in `sharded_periodic_job`." [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:42:36] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:43:12] (03CR) 10Clément Goubert: "I will re-run PCC for this one once the preceding patch has been merged, it will make the diff cleaner." [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:43:38] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:44:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T391056)', diff saved to https://phabricator.wikimedia.org/P75184 and previous config saved to /var/cache/conftool/dbconfig/20250417-124421-fceratto.json [12:44:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:44:32] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#10751741 (10cmooney) @Volans thanks for this! I got to test it out in earnest merging a patch to add a user so I needed to roll out to absolutely everything. Worked really well... [12:45:34] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1180.eqiad.wmnet with reason: host reimage [12:46:02] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:46:30] (03CR) 10Andrew Bogott: [C:03+2] Add wmcs-bastionless utility script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1118526 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [12:46:37] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:49:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1180.eqiad.wmnet with reason: host reimage [12:49:06] (03CR) 10Dreamy Jazz: [C:03+1] frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [12:49:15] jouncebot: nowandnext [12:49:16] For the next 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1200) [12:49:16] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1300) [12:49:26] (03PS1) 10DCausse: Gracefully handle BadRevisionException [extensions/CirrusSearch] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137280 (https://phabricator.wikimedia.org/T382904) [12:49:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CirrusSearch] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137280 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [12:49:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [12:50:49] (03CR) 10Kamila Součková: [C:03+1] updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:50:59] (03PS1) 10DCausse: Gracefully handle BadRevisionException [extensions/CirrusSearch] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1137281 (https://phabricator.wikimedia.org/T382904) [12:51:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CirrusSearch] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1137281 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [12:51:31] (03CR) 10Hashar: "> > That was deemed a problem in T387781" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [12:52:44] (03CR) 10Dreamy Jazz: [C:03+1] frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [12:53:29] (03CR) 10Dreamy Jazz: [C:03+1] "To clarify, the blockers for this have been worked out by de-coupling protected variable access from seeing temporary account IP addresses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [12:53:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1065.eqiad.wmnet [12:53:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1065.eqiad.wmnet [12:53:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:19] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:54] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10751768 (10Ladsgroup) They are much healthier now. After holidays, I'll do another round of top offenders, by that time a lot should h... [12:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:55:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:55] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:57:11] (03CR) 10Kamila Součková: [C:03+1] sharded_periodic_jobs: Kubernetes CronJob compat [puppet] - 10https://gerrit.wikimedia.org/r/1137227 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P75185 and previous config saved to /var/cache/conftool/dbconfig/20250417-125928-fceratto.json [12:59:52] (03CR) 10Clément Goubert: [C:03+2] sharded_periodic_jobs: Kubernetes CronJob compat [puppet] - 10https://gerrit.wikimedia.org/r/1137227 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1300). [13:00:05] dcausse and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] \o [13:00:11] o/ [13:00:32] I can self-deploy my changes [13:00:52] Also happy to wait dcausse you want to go first. [13:01:01] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:02] same, please go ahead Dreamy_Jazz mine have to go through CI [13:01:08] Sure. [13:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [13:03:06] (03Merged) 10jenkins-bot: frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [13:03:14] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10751805 (10Ladsgroup) [13:03:40] (03PS5) 10Jforrester: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:03:40] (03CR) 10Jforrester: Avoid using wikitech dblist in configs (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:03:51] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1101182|frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. (T381722)]] [13:03:54] T381722: Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops - https://phabricator.wikimedia.org/T381722 [13:04:24] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:06:19] (03CR) 10Jforrester: "Other than the newly-added exemption for in testNoUnusedDblistsLoaded, the last remaining references to ('|")wikitech('|") are in wmfImpor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:07:01] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53801 bytes in 2.465 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:16] (03CR) 10Majavah: "`wmfImportSources` is interwiki prefixes IIRC, so that can remain as is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:08:44] (03PS6) 10Jforrester: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:09:26] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [13:09:32] !log dreamyjazz@deploy1003 dreamyjazz, wpld: Backport for [[gerrit:1101182|frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. (T381722)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:36] T381722: Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops - https://phabricator.wikimedia.org/T381722 [13:09:56] testing.... [13:10:28] !log dreamyjazz@deploy1003 dreamyjazz, wpld: Continuing with sync [13:11:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1180.eqiad.wmnet with OS bullseye [13:11:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10751843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1180.eqiad.wmnet with OS bulls... [13:14:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P75186 and previous config saved to /var/cache/conftool/dbconfig/20250417-131435-fceratto.json [13:14:41] (03PS1) 10Robertsky: wikimaniawiki: add extendedconfirmed to translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137283 (https://phabricator.wikimedia.org/T389729) [13:16:02] (03CR) 10Clément Goubert: [C:03+2] updatequerypages: Move deadendpages to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137228 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:17:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1101182|frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. (T381722)]] (duration: 13m 18s) [13:17:13] T381722: Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops - https://phabricator.wikimedia.org/T381722 [13:18:37] o/ [13:19:03] dcausse: are you self-servicing next? [13:19:08] o/ [13:19:11] Lucas_WMDE: yes [13:19:14] ok :) [13:22:13] Dreamy_Jazz: just saw sync finished, can I go ahead? [13:22:17] (03CR) 10Dreamy Jazz: [C:03+1] wikimaniawiki: add extendedconfirmed to translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137283 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:22:22] Yes. [13:22:28] ack, thanks [13:22:41] Sorry if you were waiting for an explicit confirmation. [13:23:12] np :) [13:23:15] I think Robertsky would also like to have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1137283 merged during this window. I've +1'd the change. [13:23:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137283 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:23:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137280 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [13:23:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1137281 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [13:24:11] hihi. :) apologies for the last minute addition... [13:25:07] Lucas_WMDE: Were you wanting to deploy anything? If not, I could deploy robertsky's change after this deploy. [13:25:10] (03Merged) 10jenkins-bot: Gracefully handle BadRevisionException [extensions/CirrusSearch] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137280 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [13:25:13] (03Merged) 10jenkins-bot: Gracefully handle BadRevisionException [extensions/CirrusSearch] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1137281 (https://phabricator.wikimedia.org/T382904) (owner: 10DCausse) [13:25:15] happy to ship https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1137283 right after unless someone wants to [13:25:24] Dreamy_Jazz: not particularly, feel free to go ahead [13:25:29] or dcausse ^^ [13:25:35] either way :) [13:25:35] If you wanted to deploy dcausse that would be great. [13:25:38] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1137280|Gracefully handle BadRevisionException (T382904)]], [[gerrit:1137281|Gracefully handle BadRevisionException (T382904)]] [13:25:42] T382904: MediaWiki\Revision\BadRevisionException: The content of this revision is missing or corrupted (bad schema) - https://phabricator.wikimedia.org/T382904 [13:26:02] I probably should head to prepare for a meeting. [13:26:09] Dreamy_Jazz: sure, I'll ship it in a minute [13:26:10] so it would be good to go focus on that [13:26:13] Thanks! [13:26:20] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:26:39] Thanks! [13:28:03] (03CR) 10Hashar: [C:04-1] gerrit: avoid hardcoded hostnames, replace with hiera lookups (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [13:28:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage cirrussearch hosts - bking@cumin2002 - T388610 [13:28:49] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [13:28:58] (03CR) 10Kamila Součková: updatequerypages: Move deadendpages-s3 to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:29:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T391056)', diff saved to https://phabricator.wikimedia.org/P75187 and previous config saved to /var/cache/conftool/dbconfig/20250417-132942-fceratto.json [13:29:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:29:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: Maintenance [13:30:05] (03CR) 10Arnaudb: [C:03+2] gerrit: do not replicate apps/ [puppet] - 10https://gerrit.wikimedia.org/r/1137256 (https://phabricator.wikimedia.org/T392198) (owner: 10Hashar) [13:30:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T391056)', diff saved to https://phabricator.wikimedia.org/P75188 and previous config saved to /var/cache/conftool/dbconfig/20250417-133004-fceratto.json [13:30:46] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1137280|Gracefully handle BadRevisionException (T382904)]], [[gerrit:1137281|Gracefully handle BadRevisionException (T382904)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:50] T382904: MediaWiki\Revision\BadRevisionException: The content of this revision is missing or corrupted (bad schema) - https://phabricator.wikimedia.org/T382904 [13:30:52] testing [13:31:31] !log dcausse@deploy1003 dcausse: Continuing with sync [13:32:19] (03CR) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:33:50] (03PS1) 10Volans: tests: refactor global tests for all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 [13:33:50] (03PS1) 10Volans: sre.mysql.clone: fix warnings/tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 [13:35:34] (03CR) 10FNegri: [C:03+1] "Tested, this seems to work fine!" [puppet] - 10https://gerrit.wikimedia.org/r/1137224 (https://phabricator.wikimedia.org/T385885) (owner: 10Majavah) [13:36:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T391056)', diff saved to https://phabricator.wikimedia.org/P75189 and previous config saved to /var/cache/conftool/dbconfig/20250417-133618-fceratto.json [13:36:23] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:38:02] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137280|Gracefully handle BadRevisionException (T382904)]], [[gerrit:1137281|Gracefully handle BadRevisionException (T382904)]] (duration: 12m 23s) [13:38:06] T382904: MediaWiki\Revision\BadRevisionException: The content of this revision is missing or corrupted (bad schema) - https://phabricator.wikimedia.org/T382904 [13:38:15] (03PS10) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) [13:38:21] robertsky: going to ship your config change [13:38:24] (03CR) 10Gergő Tisza: "I wonder if we should move these documentation-only dblists (as opposed to the ones in MWMultiVersion::DB_LISTS that actually work in conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [13:38:49] ok [13:39:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137283 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:39:31] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:40:40] (03Merged) 10jenkins-bot: wikimaniawiki: add extendedconfirmed to translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137283 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:41:05] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1137283|wikimaniawiki: add extendedconfirmed to translationadmin (T389729)]] [13:41:09] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:44:41] have verified the changes on debug. : [13:44:44] :) [13:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:46:06] !log dcausse@deploy1003 dcausse, robertsky: Backport for [[gerrit:1137283|wikimaniawiki: add extendedconfirmed to translationadmin (T389729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:46:09] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:46:13] (03PS11) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) [13:47:00] robertsky: ok, shipping :) [13:47:35] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:47:57] !log dcausse@deploy1003 dcausse, robertsky: Continuing with sync [13:48:42] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage cirrussearch hosts - bking@cumin2002 - T388610 [13:48:45] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [13:49:52] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage cirrussearch hosts - bking@cumin2002 - T388610 [13:51:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P75190 and previous config saved to /var/cache/conftool/dbconfig/20250417-135125-fceratto.json [13:54:31] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137283|wikimaniawiki: add extendedconfirmed to translationadmin (T389729)]] (duration: 13m 25s) [13:54:34] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10752072 (10phaultfinder) [13:55:11] robertsky: should be live :) [13:55:50] yup. thanks. verified. :) [13:57:00] !log closing the UTC afternoon backport window [13:57:02] yw! [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:44] (03PS1) 10Effie Mouzeli: switch mwdebug2001 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137290 (https://phabricator.wikimedia.org/T391452) [14:01:08] (03CR) 10CI reject: [V:04-1] switch mwdebug2001 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137290 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [14:01:43] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug2002.codfw.wmnet [14:02:03] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:02:21] !log jiji@cumin1002 conftool action : set/pooled=yes; selector: name=mwdebug2002.codfw.wmnet [14:02:28] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug1002.codfw.wmnet [14:02:55] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug2001.codfw.wmnet [14:02:59] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:03:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:04:16] (03PS2) 10Effie Mouzeli: switch mwdebug2001 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137290 (https://phabricator.wikimedia.org/T391452) [14:04:30] jouncebot: now [14:04:30] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [14:04:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:06:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P75191 and previous config saved to /var/cache/conftool/dbconfig/20250417-140632-fceratto.json [14:09:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:10:19] (03CR) 10Effie Mouzeli: [C:03+2] "+2ing as it is similar to I4d52305c3bdfaebb5f9d748f38b018b5b6f09f4f" [puppet] - 10https://gerrit.wikimedia.org/r/1137290 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [14:10:20] effie: I am gonna restart Gerrit [14:10:30] I was waiting for the window to complete [14:10:39] it takes a minute or so :) [14:10:44] oh [14:11:10] hashar: merged, please ping me when it is back [14:11:42] !log Restarting Gerrit to apply replication configuration change [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:02] was about to complain about gerrit being down but that explains :-) [14:14:25] of course the day it is supposed to be fast, it takes a while :/ [14:14:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:43] Apr 17 14:14:33 gerrit2002 systemd[1]: gerrit.service: State 'stop-sigterm' timed out. Killing. [14:14:47] it got SIGKILL by systemd [14:14:49] :/ [14:15:32] effie: Gerrit is back [14:15:35] thank you! [14:15:43] cheers! [14:15:54] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs: toolsdb: Stop wmf-mariadb106 from auto-upgrading [puppet] - 10https://gerrit.wikimedia.org/r/1137224 (https://phabricator.wikimedia.org/T385885) (owner: 10Majavah) [14:17:25] FIRING: [2x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:21] (03CR) 10Kamila Součková: [C:03+1] "LGTM except for commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:19:31] RESOLVED: [4x] ProbeDown: Service gerrit2002:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T391056)', diff saved to https://phabricator.wikimedia.org/P75193 and previous config saved to /var/cache/conftool/dbconfig/20250417-142139-fceratto.json [14:21:44] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:21:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:22:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:22:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75194 and previous config saved to /var/cache/conftool/dbconfig/20250417-142221-fceratto.json [14:22:25] RESOLVED: [2x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:07] (03PS12) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) [14:23:15] (03CR) 10Clément Goubert: updatequerypages: Move deadendpages-s3 to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:26:16] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10752207 (10fnegri) > We should comp air inlet air temps from any neighboring device before assuming that is bad though. The next two servers in t... [14:26:24] (03CR) 10Clément Goubert: [C:03+2] updatequerypages: Move deadendpages-s3 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1137261 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:27:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75195 and previous config saved to /var/cache/conftool/dbconfig/20250417-142746-fceratto.json [14:27:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:30:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:31:31] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:32:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:32:19] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:33:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:33:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10752236 (10fnegri) p:05High→03Medium [14:34:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10752237 (10fnegri) 05Stalled→03In progress [14:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10752239 (10phaultfinder) [14:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:36:15] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mwdebug2001.codfw.wmnet with OS bullseye [14:42:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10752273 (10RobH) 05Open→03Stalled a:05KOfori→03None Please do not take any actions on this task as the fundraising destination rack will likely now shi... [14:42:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P75196 and previous config saved to /var/cache/conftool/dbconfig/20250417-144254-fceratto.json [14:46:42] FIRING: JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:29] (03CR) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) (owner: 10Fabfur) [14:49:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10752290 (10elukey) @MatthewVernon: I am not sure if we have to rely on the current objectX (with X==Integer) format for the /srv/swift-storage dirs, but... [14:51:33] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:42] RESOLVED: JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:50] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug2001.codfw.wmnet with reason: host reimage [14:54:35] (03PS7) 10Clément Goubert: updatequerypages: Move all to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) [14:54:49] (03PS6) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [14:55:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:55:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage cirrussearch hosts - bking@cumin2002 - T388610 [14:55:57] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:56:45] (03CR) 10Clément Goubert: "Like I2314e4063ef8095166643d8f2772a484064a7010 but for all `updatequerypages` resources" [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:57:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug2001.codfw.wmnet with reason: host reimage [14:57:52] (03PS8) 10Clément Goubert: updatequerypages: Move all to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) [14:58:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P75197 and previous config saved to /var/cache/conftool/dbconfig/20250417-145801-fceratto.json [14:58:09] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:58:53] (03PS9) 10Clément Goubert: updatequerypages: Move all to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) [15:02:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10752326 (10RobH) 05Open→03Stalled Please do not take any actions on this task as the fundraising destination rack will like... [15:02:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10752329 (10RobH) a:05RobH→03None Please do not take any actions on this task as the fundraising destination rack will likely now shift from D6 to row B. We'll have mor... [15:02:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10752333 (10RobH) a:05Gehel→03None [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:57] FIRING: [2x] JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10752388 (10RobH) a:03Jclark-ctr Summary from where I see things: * I think that B8 is the best choice as it requires the least amount of new infrastructure and bu... [15:07:58] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:13:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75198 and previous config saved to /var/cache/conftool/dbconfig/20250417-151308-fceratto.json [15:13:12] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:13:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: Maintenance [15:13:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75199 and previous config saved to /var/cache/conftool/dbconfig/20250417-151330-fceratto.json [15:21:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10752470 (10MatthewVernon) So [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/swift/storage/configure_disks.p... [15:25:12] (03PS2) 10Federico Ceratto: values.yaml: Update chart for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) [15:25:12] (03CR) 10Federico Ceratto: "Small config cleanup; also addresses comments from" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:25:52] (03PS13) 10Filippo Giunchedi: prometheus/alerts: define alert rules directly in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1101066 (https://phabricator.wikimedia.org/T381665) (owner: 10Tiziano Fogli) [15:26:42] FIRING: [2x] JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75200 and previous config saved to /var/cache/conftool/dbconfig/20250417-152724-fceratto.json [15:27:29] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:27:37] (03CR) 10Clément Goubert: [C:03+1] values.yaml: Update chart for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:31:42] FIRING: [2x] JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:31:49] (03CR) 10Clément Goubert: [C:03+1] values.yaml: Update chart for zarcillo in aux-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:34:16] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwdebug2001.codfw.wmnet with OS bullseye [15:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10752550 (10phaultfinder) [15:35:02] (03PS1) 10Effie Mouzeli: mediawiki rsyslog: fix logging to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1137315 [15:36:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job php in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:18] (03CR) 10RLazarus: [C:03+2] helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [15:42:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P75201 and previous config saved to /var/cache/conftool/dbconfig/20250417-154231-fceratto.json [15:47:27] (03Merged) 10jenkins-bot: helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [15:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:58] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2066 to cirrussearch2066 [15:54:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:55:05] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:12] (03CR) 10Scott French: [C:03+1] mediawiki rsyslog: fix logging to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1137315 (owner: 10Effie Mouzeli) [15:56:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:24] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki rsyslog: fix logging to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1137315 (owner: 10Effie Mouzeli) [15:57:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P75204 and previous config saved to /var/cache/conftool/dbconfig/20250417-155738-fceratto.json [15:59:58] bking@cumin2002 rename (PID 1726744) is awaiting input [16:00:04] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:14] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1137284 (owner: 10Volans) [16:00:20] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 (owner: 10Volans) [16:00:28] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [16:00:32] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans) [16:00:36] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [16:00:40] (03CR) 10Volans: "As I'll be out the next few days feel free to merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans) [16:03:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:06:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:07:04] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2066 to cirrussearch2066 - bking@cumin2002" [16:07:54] (03PS5) 10Volans: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [16:09:25] (03CR) 10Volans: "I've updated the CR with what suggested and fixed existing tests." [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [16:09:26] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2066 to cirrussearch2066 - bking@cumin2002" [16:09:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:27] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2066 [16:09:38] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2066 [16:10:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2066 to cirrussearch2066 [16:10:37] (03PS3) 10Elukey: profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) [16:10:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2066.codfw.wmnet with OS bullseye [16:11:04] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2066 [16:11:16] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:11:55] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5316/console" [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [16:12:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75205 and previous config saved to /var/cache/conftool/dbconfig/20250417-161245-fceratto.json [16:12:49] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:13:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: Maintenance [16:13:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T391056)', diff saved to https://phabricator.wikimedia.org/P75206 and previous config saved to /var/cache/conftool/dbconfig/20250417-161307-fceratto.json [16:14:39] (03PS4) 10Elukey: profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) [16:15:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2066 - bking@cumin2002" [16:15:30] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2066 - bking@cumin2002" [16:15:30] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:15:31] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2066.codfw.wmnet 69.32.192.10.in-addr.arpa 9.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:15:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2066.codfw.wmnet 69.32.192.10.in-addr.arpa 9.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:15:35] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2066 [16:16:13] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5317/console" [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [16:17:10] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2066 [16:17:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2066 [16:17:47] (03PS5) 10Elukey: profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) [16:17:58] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10752778 (10RobH) ` robh@cp4047:~$ zgrep "Buffer I/O error on dev nvme0n1" /var/log/kern.log.2.gz |wc -l 44 robh@cp4047:~$ fgrep "Buffer I/O error on dev nvme0n1" /var/log/kern.log.1 |wc -l 0 robh@cp4047... [16:18:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:18:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T391056)', diff saved to https://phabricator.wikimedia.org/P75207 and previous config saved to /var/cache/conftool/dbconfig/20250417-161854-fceratto.json [16:18:58] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5318/console" [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [16:19:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:19:44] (03CR) 10Elukey: [V:03+1] "pcc doesn't show any diff but if you check the change catalog the mountpoint look ok." [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [16:19:54] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [16:20:00] (03CR) 10Scott French: [C:03+2] PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [16:21:49] (03CR) 10Kamila Součková: [C:03+1] updatequerypages: Move all to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [16:21:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10752788 (10elukey) @MatthewVernon we could go for something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137243, test ms-be1091 and decide... [16:23:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:24:09] PROBLEM - Host restbase2035 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:17] (03CR) 10AOkoth: [C:03+1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [16:25:19] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:26:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137323 [16:26:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137323 (owner: 10TrainBranchBot) [16:28:03] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:28:39] FIRING: [6x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:30:01] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:30:03] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:30:05] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:30:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:30:31] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:32:09] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.374 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:15] * urandom is looking at the restbase host... [16:32:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:48] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2066.codfw.wmnet with reason: host reimage [16:33:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:34:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P75208 and previous config saved to /var/cache/conftool/dbconfig/20250417-163403-fceratto.json [16:34:48] (03CR) 10Clément Goubert: [C:03+2] updatequerypages: Move all to sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1137306 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [16:35:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2066.codfw.wmnet with reason: host reimage [16:36:48] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10752868 (10Dzahn) @thcipriani You meant mailman, right? [16:38:08] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10752869 (10Dzahn) Please come up with a good name that doesn't conflict too much with https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(lists) [16:41:08] (03CR) 10Kamila Součková: helmfile_namespaces: Merge hiera services with admin_ng namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [16:41:16] (03CR) 10Kamila Součková: [C:03+1] helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [16:43:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:48:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137323 (owner: 10TrainBranchBot) [16:49:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P75209 and previous config saved to /var/cache/conftool/dbconfig/20250417-164909-fceratto.json [16:49:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:50:19] 10ops-codfw, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10752952 (10Eevans) p:05Triage→03High [16:53:18] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-04-17-122309-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137325 [16:57:42] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10752986 (10Ladsgroup) [16:58:14] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10752990 (10BCornwall) 05Open→03Resolved Yeah, we're looking good! Thanks for sticking it out and doing this, @RobH! [16:58:17] (03CR) 10Kamila Součková: [C:03+1] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [16:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:35] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-04-17-122309-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137325 (owner: 10BryanDavis) [17:00:04] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1700) [17:00:35] o/ I will push out a new developer-portal build today [17:01:04] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-04-17-122309-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137325 (owner: 10BryanDavis) [17:02:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2066.codfw.wmnet with OS bullseye [17:04:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T391056)', diff saved to https://phabricator.wikimedia.org/P75210 and previous config saved to /var/cache/conftool/dbconfig/20250417-170416-fceratto.json [17:04:20] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:04:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: Maintenance [17:04:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T391056)', diff saved to https://phabricator.wikimedia.org/P75211 and previous config saved to /var/cache/conftool/dbconfig/20250417-170438-fceratto.json [17:04:53] (03CR) 10Dzahn: [C:03+2] aptrepo: add jenkins to bookworm section in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [17:07:15] 10ops-codfw, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10753029 (10Eevans) [17:08:51] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:09:05] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:09:22] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:09:43] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:10:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T391056)', diff saved to https://phabricator.wikimedia.org/P75212 and previous config saved to /var/cache/conftool/dbconfig/20250417-171032-fceratto.json [17:10:36] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:11:36] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/f6f5517444c0e6ac6856ee72ce652871a39bd66d0e257e08749d2e18dbdeec17/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [17:11:47] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:12:06] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:13:54] re: disk space on releases1003 - meh.. it's the issue again with docker overlay fs. but dont know what changed. probably people working on it yesterday. we need some config change to exclude that or .. something [17:17:25] ACKNOWLEDGEMENT - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/f6f5517444c0e6ac6856ee72ce652871a39bd66d0e257e08749d2e18dbdeec17/merged is not accessible: Permission denied daniel_zahn https://phabricator.wikimedia.org/T392127#10753052 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [17:20:05] !log idp-test2005 - 100% disk space used - alerting since over 6 days (is there a point in alerts for test hosts?) - apt-get clean .. brought it back to 94% [17:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P75213 and previous config saved to /var/cache/conftool/dbconfig/20250417-172539-fceratto.json [17:28:50] (03PS1) 10Dzahn: idp-test: disable monitoring notifications, copy theme from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1137329 [17:29:15] (03CR) 10CI reject: [V:04-1] idp-test: disable monitoring notifications, copy theme from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1137329 (owner: 10Dzahn) [17:29:40] (03PS1) 10Jforrester: [wikifunctionswiki] Enable Parsoid in wikitext articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 [17:29:59] (03PS2) 10Dzahn: idp-test: disable monitoring notifications, copy theme setting [puppet] - 10https://gerrit.wikimedia.org/r/1137329 [17:30:07] (03CR) 10Jforrester: "I assume this is OK for us to enable, but wanted to check!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [17:32:31] ACKNOWLEDGEMENT - Postfix SMTP on crm2001 is CRITICAL: connect to address 10.192.0.18 and port 25: Connection refused daniel_zahn https://phabricator.wikimedia.org/T383715 https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [17:40:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P75214 and previous config saved to /var/cache/conftool/dbconfig/20250417-174046-fceratto.json [17:43:39] FIRING: [2x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:56] RECOVERY - Host restbase2035 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [17:45:02] (03PS1) 10Jforrester: tests: Add a Wikifunctions-related test suite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137333 [17:46:38] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10753277 (10thcipriani) Thanks for the reply, both! Yep, that was the plan, mailman mailing list. >>! In T388922#10752869, @Dzahn wrote: > Please come up with a good name... [17:50:00] (03CR) 10RLazarus: [C:03+2] helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:50:09] (03CR) 10CI reject: [V:04-1] helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:50:50] (03PS4) 10RLazarus: helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:51:36] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [17:53:06] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:06] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:08] (03CR) 10RLazarus: helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:53:11] (03CR) 10RLazarus: [C:03+2] helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:53:44] FIRING: [8x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:02] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:54:04] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:54:18] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2079 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1355, active_shards: 4186, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 3, delayed_unassigned_shards: 0, number_of_pending [17:54:18] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 12, active_shards_percent_as_number: 99.88069673109044 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:55:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T391056)', diff saved to https://phabricator.wikimedia.org/P75215 and previous config saved to /var/cache/conftool/dbconfig/20250417-175552-fceratto.json [17:55:56] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:56:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1243.eqiad.wmnet with reason: Maintenance [17:56:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T391056)', diff saved to https://phabricator.wikimedia.org/P75216 and previous config saved to /var/cache/conftool/dbconfig/20250417-175614-fceratto.json [17:58:24] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2079 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1678, active_shards: 5032, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of [17:58:24] _tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1159, active_shards_percent_as_number: 99.98013113451222 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:00:05] dduvall and brennen: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T1800) [18:00:14] o/ [18:00:19] o/ [18:01:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2079-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:02:17] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137336 (https://phabricator.wikimedia.org/T386220) [18:02:19] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137336 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:03:15] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137336 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:03:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:08:26] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10753431 (10BCornwall) Hi, @RobH. Was Dell able to investigate cp7001 since it's had the offset removed? It hovers around 80° now. [18:09:36] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10753432 (10RobH) >>! In T386959#10753431, @BCornwall wrote: > Hi, @RobH. Was Dell able to investigate cp7001 since it's had the offset removed? It hovers around 80° now. I hadn't reinvestigated this since... [18:09:51] (03Merged) 10jenkins-bot: helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [18:13:33] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.25 refs T386220 [18:13:39] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [18:14:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T391056)', diff saved to https://phabricator.wikimedia.org/P75217 and previous config saved to /var/cache/conftool/dbconfig/20250417-181408-fceratto.json [18:14:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:14:27] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10753453 (10Eevans) a:03Jhancock.wm [18:15:09] (03PS2) 10RLazarus: Revert "Add second pair of kubeconfig files for restricted users" [puppet] - 10https://gerrit.wikimedia.org/r/1127064 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [18:20:59] brennen: looks mostly quiet. that "undefined property" error was already filed as https://phabricator.wikimedia.org/T391869 [18:21:29] dduvall: ::nod:: - looks fairly chill [18:23:26] (03PS4) 10RLazarus: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [18:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10753480 (10phaultfinder) [18:29:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P75218 and previous config saved to /var/cache/conftool/dbconfig/20250417-182916-fceratto.json [18:29:47] whoops, I broke puppet on the deployment server -- fixing forward shortly [18:33:00] (03PS1) 10RLazarus: deployment_server: Fix a type mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1137339 (https://phabricator.wikimedia.org/T378429) [18:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:18] (03CR) 10Bking: [C:03+1] deployment_server: Fix a type mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1137339 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [18:37:54] (03CR) 10RLazarus: [C:03+2] deployment_server: Fix a type mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1137339 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [18:44:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P75219 and previous config saved to /var/cache/conftool/dbconfig/20250417-184423-fceratto.json [18:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:57:17] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10753603 (10Dzahn) ` Created mailing list: patchdemo@lists.wikimedia.org ` @thcipriani I just ran the shell command to create the list. With a single --owner parameter w... [18:58:49] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10753608 (10Dzahn) Here is the form where users can subscribe to the list: https://lists.wikimedia.org/postorius/lists/patchdemo.lists.wikimedia.org/ Also shows archives... [18:59:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T391056)', diff saved to https://phabricator.wikimedia.org/P75221 and previous config saved to /var/cache/conftool/dbconfig/20250417-185930-fceratto.json [18:59:34] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:59:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: Maintenance [18:59:49] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10753624 (10Dzahn) Requests for configuration changes should now go to `patchdemo-owner@lists.wikimedia.org`. :) [19:00:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:03:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:03:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T391056)', diff saved to https://phabricator.wikimedia.org/P75222 and previous config saved to /var/cache/conftool/dbconfig/20250417-190331-fceratto.json [19:09:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T391056)', diff saved to https://phabricator.wikimedia.org/P75223 and previous config saved to /var/cache/conftool/dbconfig/20250417-190923-fceratto.json [19:09:28] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:14:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:17:30] (03PS1) 10Robertsky: wikimaniawiki: update logo to 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) [19:21:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [19:21:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:22:33] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye [19:22:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b... [19:24:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P75225 and previous config saved to /var/cache/conftool/dbconfig/20250417-192430-fceratto.json [19:35:43] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1178.eqiad.wmnet with OS bullseye [19:35:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS bulls... [19:36:20] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye [19:36:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b... [19:38:19] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2035 is down - https://phabricator.wikimedia.org/T392243#10753716 (10Jhancock.wm) A2 was reseated A2 was still producing errors. A2 swapped with B2 errors followed to B2. replaced with in stock 32GB 3200 DIMM card started RMA process with Dell to replace t... [19:39:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P75226 and previous config saved to /var/cache/conftool/dbconfig/20250417-193935-fceratto.json [19:42:27] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:43:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:44:08] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [19:44:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181.eqiad.wmnet with OS b... [19:50:17] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1178.eqiad.wmnet with OS bullseye [19:50:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753737 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS bulls... [19:50:45] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T391056)', diff saved to https://phabricator.wikimedia.org/P75228 and previous config saved to /var/cache/conftool/dbconfig/20250417-195442-fceratto.json [19:54:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:54:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: Maintenance [19:55:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T391056)', diff saved to https://phabricator.wikimedia.org/P75229 and previous config saved to /var/cache/conftool/dbconfig/20250417-195506-fceratto.json [19:58:52] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1181.eqiad.wmnet with reason: host reimage [19:58:58] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [19:59:01] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [19:59:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [19:59:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T2000). [20:00:05] robertsky: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T391056)', diff saved to https://phabricator.wikimedia.org/P75230 and previous config saved to /var/cache/conftool/dbconfig/20250417-200008-fceratto.json [20:00:12] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:00:29] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2058 to cirrussearch2058 [20:00:40] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:02:11] I am around. [20:02:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1181.eqiad.wmnet with reason: host reimage [20:05:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2058 to cirrussearch2058 - bking@cumin2002" [20:06:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2058 to cirrussearch2058 - bking@cumin2002" [20:06:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:14] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2058 [20:06:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753771 (10VRiley-WMF) Currently, an-worker1180 and an-worker1181 are finished and ready. Still working on the rest though. [20:06:28] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2058 [20:07:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2058 to cirrussearch2058 [20:07:08] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2058.codfw.wmnet on all recursors [20:07:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2058.codfw.wmnet on all recursors [20:08:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2058.codfw.wmnet with OS bullseye [20:09:06] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2058 [20:09:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:10:27] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye [20:10:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b... [20:12:09] bking@cumin2002 rolling-operation (PID 1980722) is awaiting input [20:12:17] (03PS5) 10RLazarus: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [20:13:39] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:15:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P75231 and previous config saved to /var/cache/conftool/dbconfig/20250417-201515-fceratto.json [20:18:12] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2058 - bking@cumin2002" [20:18:17] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2058 - bking@cumin2002" [20:18:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:17] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2058.codfw.wmnet 205.16.192.10.in-addr.arpa 5.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:18:21] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2058.codfw.wmnet 205.16.192.10.in-addr.arpa 5.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:18:22] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2058 [20:18:36] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2058 [20:18:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2058 [20:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:25:13] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1178.eqiad.wmnet with reason: host reimage [20:25:17] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1178.eqiad.wmnet with reason: host reimage [20:25:27] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1181.eqiad.wmnet with OS bullseye [20:25:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqiad.wmnet with OS bulls... [20:26:20] (03CR) 10Chlod Alejandro: [C:03+1] wikimaniawiki: update logo to 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [20:27:34] anyone around to run the backport window? [20:30:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P75232 and previous config saved to /var/cache/conftool/dbconfig/20250417-203021-fceratto.json [20:31:06] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10753805 (10jhathaway) @Eevans based on some help from ITS I was able to get the root of the issue, patch above for you review. [20:34:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2058.codfw.wmnet with reason: host reimage [20:37:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1183.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:38:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2058.codfw.wmnet with reason: host reimage [20:45:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T391056)', diff saved to https://phabricator.wikimedia.org/P75233 and previous config saved to /var/cache/conftool/dbconfig/20250417-204528-fceratto.json [20:45:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:45:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1249.eqiad.wmnet with reason: Maintenance [20:45:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P75234 and previous config saved to /var/cache/conftool/dbconfig/20250417-204552-fceratto.json [20:50:00] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1178.eqiad.wmnet with OS bullseye [20:50:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS bulls... [20:55:21] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1183.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:56:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1183.eqiad.wmnet with OS bullseye [20:56:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1183.eqiad.wmnet with OS b... [20:58:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250417T2100) [21:03:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:10] (03PS1) 10Ryan Kemper: fix inconsequential typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 [21:05:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2058.codfw.wmnet with OS bullseye [21:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:08:22] bking@cumin2002 rolling-operation (PID 1980722) is awaiting input [21:08:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753886 (10VRiley-WMF) An-worker1178 is fully completed [21:08:35] (03PS2) 10Ryan Kemper: fix inconsequential typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 [21:09:05] (03CR) 10Bking: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 (owner: 10Ryan Kemper) [21:11:51] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1183.eqiad.wmnet with reason: host reimage [21:12:39] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2078 to cirrussearch2078 [21:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:13:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:15:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1183.eqiad.wmnet with reason: host reimage [21:16:53] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:17:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2078 to cirrussearch2078 - bking@cumin2002" [21:17:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2078 to cirrussearch2078 - bking@cumin2002" [21:17:37] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:17:38] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2078 [21:17:53] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2078 [21:18:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2078 to cirrussearch2078 [21:18:34] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2078.codfw.wmnet on all recursors [21:18:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2078.codfw.wmnet on all recursors [21:18:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [21:19:03] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [21:19:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host cirrussearch2078 [21:19:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2078.codfw.wmnet with OS bullseye [21:19:39] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2097 to cirrussearch2097 [21:20:01] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:25:32] bking@cumin2002 rolling-operation (PID 1980722) is awaiting input [21:28:54] (03PS1) 10Dzahn: aptrepo: add thirdparty/ci component to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) [21:29:42] (03CR) 10Dzahn: [C:03+2] "needs a follow-up. jenkins package is added to list of upgrade-able packages but references thirdparty/ci component which is not yet in di" [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [21:32:25] (03CR) 10Dzahn: gerrit: switchover to gerrit2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:33:06] (03CR) 10Dzahn: "being bold and fixing commit message to (switchover to gerrit1003,not gerrit2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:33:16] (03PS2) 10Dzahn: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:35:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:35:49] (03CR) 10Dzahn: "So is the intention to revert the last switchover? This looks like it would simultaneously change what the active_host is (back to eqiad) " [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:37:56] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1183.eqiad.wmnet with OS bullseye [21:38:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1183.eqiad.wmnet with OS bulls... [21:39:08] (03CR) 10Dzahn: "This looks good for a clean revert between (existing/old) gerrit and gerrit-replica (1003 vs 2002). But it does not match the other patch" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:40:11] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2097 to cirrussearch2097 - bking@cumin2002" [21:40:16] (03CR) 10Dzahn: "btw, you can link between gerrit patches by copy/pasting a little bit of the beginning of a change-id. example: I95b7e2e919" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [21:41:54] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2097 to cirrussearch2097 - bking@cumin2002" [21:41:54] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:41:55] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2097 [21:42:05] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2097 [21:42:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2097 to cirrussearch2097 [21:42:46] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2097.codfw.wmnet on all recursors [21:42:49] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2097.codfw.wmnet on all recursors [21:45:50] bking@cumin2002 rolling-operation (PID 1980722) is awaiting input [21:46:01] (03CR) 10Dzahn: "So the idea here was that _because_ the gerrit replica is also a production host (can't be taken down anytime, has its own users) it can't" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [21:46:02] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [21:46:06] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [21:46:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [21:46:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P75235 and previous config saved to /var/cache/conftool/dbconfig/20250417-214610-fceratto.json [21:46:14] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:50:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2097.codfw.wmnet with OS bullseye [21:50:23] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2097 [21:51:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753983 (10VRiley-WMF) an-worker1183 is fully completed [21:52:38] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1184.eqiad.wmnet with OS bullseye [21:52:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10753984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS b... [21:53:26] bking@cumin2002 rolling-operation (PID 1980722) is awaiting input [21:53:40] FIRING: [2x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:16] bking@cumin2002 reimage (PID 2090647) is awaiting input [21:54:38] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:57:28] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [21:58:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [21:58:40] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [21:58:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [22:01:05] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2097 - bking@cumin2002" [22:01:11] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2097 - bking@cumin2002" [22:01:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:11] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2097.codfw.wmnet 234.16.192.10.in-addr.arpa 4.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:01:15] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2097.codfw.wmnet 234.16.192.10.in-addr.arpa 4.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:01:16] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2097 [22:01:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P75236 and previous config saved to /var/cache/conftool/dbconfig/20250417-220116-fceratto.json [22:01:28] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2097 [22:01:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2097 [22:02:54] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [22:03:42] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2078.codfw.wmnet'] [22:04:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cirrussearch2078.codfw.wmnet'] [22:04:55] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2078.codfw.wmnet'] [22:10:57] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage [22:13:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2078.codfw.wmnet'] [22:14:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage [22:15:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [22:15:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [22:15:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [22:16:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P75237 and previous config saved to /var/cache/conftool/dbconfig/20250417-221623-fceratto.json [22:18:33] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2097.codfw.wmnet with reason: host reimage [22:19:43] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [22:20:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [22:20:20] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [22:20:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [22:22:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2097.codfw.wmnet with reason: host reimage [22:27:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:28:36] PROBLEM - Host ncmonitor1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:28:36] RECOVERY - Host ncmonitor1001 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [22:30:48] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:31:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P75238 and previous config saved to /var/cache/conftool/dbconfig/20250417-223130-fceratto.json [22:31:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:31:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1252.eqiad.wmnet with reason: Maintenance [22:33:54] vriley@cumin1002 reimage (PID 678441) is awaiting input [22:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:39:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2147.codfw.wmnet with reason: Maintenance [22:39:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T391056)', diff saved to https://phabricator.wikimedia.org/P75239 and previous config saved to /var/cache/conftool/dbconfig/20250417-223957-fceratto.json [22:40:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:46:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T391056)', diff saved to https://phabricator.wikimedia.org/P75240 and previous config saved to /var/cache/conftool/dbconfig/20250417-224611-fceratto.json [22:46:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:48:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2097.codfw.wmnet with OS bullseye [22:51:35] (03CR) 10Dzahn: "what's the status of this in beta?" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:52:45] (03CR) 10Ahmon Dancy: [C:04-1] "The last time I tried using this, the service kept trying to run on the host where I didn't want them to run (deployment-deploy04). I hav" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:54:31] (03CR) 10Dzahn: "gotcha! no rush. I asked mostly based on the 04/16 comment that added the beta-cherry-picked hashtag" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:58:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [22:58:40] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:59:49] (03CR) 10Dzahn: "I am not sure about the expectation here. I don't want to self merge varnish frontend changes, I do have a positive comment but I think I " [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [23:00:38] (03CR) 10Dzahn: "I wonder why Gerrit says it's my turn to review on my own change with no pending comments.. hmm" [puppet] - 10https://gerrit.wikimedia.org/r/1137329 (owner: 10Dzahn) [23:01:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P75241 and previous config saved to /var/cache/conftool/dbconfig/20250417-230118-fceratto.json [23:16:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P75242 and previous config saved to /var/cache/conftool/dbconfig/20250417-231625-fceratto.json [23:31:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T391056)', diff saved to https://phabricator.wikimedia.org/P75243 and previous config saved to /var/cache/conftool/dbconfig/20250417-233131-fceratto.json [23:31:36] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:31:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: Maintenance [23:32:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:32:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T391056)', diff saved to https://phabricator.wikimedia.org/P75244 and previous config saved to /var/cache/conftool/dbconfig/20250417-233211-fceratto.json [23:35:40] (03CR) 10BryanDavis: "This one would work in the config files, but I don't have a strong opinion on if it /should/ be used." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [23:38:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T391056)', diff saved to https://phabricator.wikimedia.org/P75245 and previous config saved to /var/cache/conftool/dbconfig/20250417-233825-fceratto.json [23:38:29] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:40:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2078.codfw.wmnet with OS bullseye [23:42:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137372 [23:42:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137372 (owner: 10TrainBranchBot) [23:43:26] (03CR) 10BryanDavis: "My only concern with this is that people used to get confused and assume that `labswiki` refers to the Beta Cluster. (Naming is hard and a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [23:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:53:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P75246 and previous config saved to /var/cache/conftool/dbconfig/20250417-235331-fceratto.json [23:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137372 (owner: 10TrainBranchBot)